Summary (Overview)
- New reranking paradigm (FBNL): KaLM-Reranker-V1 introduces a fast but not late-interaction (FBNL) design based on an encoder–decoder architecture. The encoder pre-encodes passages offline (reusable across queries), while the decoder models query intent and computes fine-grained relevance via cross-attention. This decouples query–passage computation without sacrificing expressive interaction.
- Matryoshka Embedding Pooling (MEP): Passage representations are compressed along the sequence dimension (by factor (r \in {1,2,4,8,16,32})) via mean pooling, enabling flexible trade-offs between storage cost and reranking quality. At moderate compression ((r=2)–(8)), performance degradation is negligible.
- Strong performance with superior efficiency: On BEIR, KaLM-Reranker-V1-Large (4B) achieves an average nDCG@10 of 62.87, on par with Qwen3-Reranker-4B (63.50) but at ~5.5× lower online computation cost. On MIRACL (18 languages), it outperforms bge-reranker-v2-gemma (2.5B) while being nearly 2× more efficient. On LMEB, even the 0.27B Nano model competes with 7–12B embedding models.
- Multi-stage training: A progressive three-stage pipeline (general reranking → task-specific adaptation → fine-grained distillation) improves generalization and discriminative ability.
- In-depth analysis: ROC-AUC analysis confirms that moderate compression preserves discriminative power; larger models are more robust to compression. Efficiency gains reach up to 203× for long passages ((n=4096)) and 33× at (r=8).
Introduction and Theoretical Foundation
Background. Neural retrieval systems typically use a two-stage pipeline: first-stage retrieval recalls a small set of candidates, and a reranking stage performs finer-grained relevance modeling. Reranking quality is critical for search, recommendation, and retrieval-augmented generation (RAG).
Limitations of existing rerankers. Most current rerankers (encoder-based or decoder-based) jointly encode query and passage, tightly coupling their computation. This forces each query–passage pair to be recomputed online, making pre-computation impossible and scaling costly. Late-interaction methods (e.g., ColBERT) decouple query and passage encoding but limit interaction to similarity operations over independently contextualized tokens, missing deep relevance signals.
Proposed solution. The paper introduces the fast but not late-interaction (FBNL) paradigm. It uses an encoder–decoder architecture where:
- The encoder produces reusable passage representations (H_p \in \mathbb{R}^{n \times d}) (Eq. 1).
- The decoder takes the system instruction, task instruction (I), and query (q) and attends to (H_p) via merged self-attention and cross-attention (Eqs. 2–4). Relevance is scored as the softmax-normalized likelihood of "yes" vs. "no" (Eq. 5).
This design preserves fine-grained interaction (unlike late interaction) while supporting offline passage encoding (unlike joint encoders).
Methodology
Model Architecture
Built on T5Gemma2 foundation models, KaLM-Reranker-V1 comes in three sizes:
| Model | Activated Params | Non-Embedding Params | Embedding Params | #Layers | Sequence Length | Document Token Dim. | MEP Support |
|---|---|---|---|---|---|---|---|
| Nano | 0.27B | 100M | 168M | 18 | 128K | 640 | 1×–32× |
| Small | 1B | 698M | 302M | 26 | 128K | 1152 | 1×–32× |
| Large | 4B | 3209M | 675M | 34 | 128K | 2560 | 1×–32× |
| Table 1: KaLM-Reranker-V1 series specifications. MEP = Matryoshka Embedding Pooling. |
Encoder–Decoder Computation.
Given document (p), the encoder produces:
The decoder receives instruction (I) and query (q). With decoder input (X \in \mathbb{R}^{m \times d}), the merged attention uses:
Q &= XW^Q, \\ K &= [X; H_p] W^K, \\ V &= [X; H_p] W^V, \end{aligned} \tag{3}$$ and the output is:O = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_h}} + M \right) V W^O, \tag{4}
where \(M\) masks invalid positions. The relevance score is:\text{score}(q,p) = \frac{\exp(z_{\text{yes}})}{\exp(z_{\text{yes}}) + \exp(z_{\text{no}})} \tag{5}
**Prompt Template.** The encoder takes only the document; the decoder receives system instruction + user instruction + query. The model predicts "yes"/"no" as the next token (Table 2). ### Training Objective **Supervised Fine-Tuning (SFT).** For a query–document pair with binary label \(l \in \{\text{yes},\text{no}\}\), the loss is:\mathcal{L}{\text{sft}}(I,q,p,l) = -\log\frac{\exp(z_l)}{\exp(z{\text{yes}})+\exp(z_{\text{no}})} \tag{6}
**Matryoshka Embedding Pooling (MEP).** To reduce storage, MEP compresses passage representations along the sequence dimension by ratio \(r\):H_p^{(r)}[j] = \text{MeanPool}\left( H_p[(j-1)r+1 : \min(jr, n)] \right), \quad j = 1,\dots,\lceil n/r \rceil \tag{7}
The final SFT objective over a set of ratios \(R\) is:\mathcal{L}{\text{sft}}(I,q,p,l,R) = \sum{r \in R} \lambda_r \left( -\log\frac{\exp(z_l^{(r)})}{\exp(z_{\text{yes}}^{(r)})+\exp(z_{\text{no}}^{(r)})} \right) \tag{9}
By default \(\lambda_r = 1\) and \(R = \{1,2,4,8,16,32\}\). **Knowledge Distillation (KD).** For student score \(\hat{y}\) and teacher soft label \(y \in [0,1]\):\mathcal{L}_{\text{kd}} = -y\log\hat{y} - (1-y)\log(1-\hat{y}) \tag{10}
### Multi-stage Training Three progressive stages: 1. **General Reranking Ability Learning:** No task instructions; domain-agnostic binary relevance training. 2. **Task-Specific Reranking Adaptation:** Introduce task-specific instructions; higher-quality data. 3. **Fine-Grained Relevance Distillation:** Soft labels from a stronger teacher (KaLM-Reranker-V1-Large for Small & Nano). ### Complexity Analysis **Time complexity** (Table 3): | Model Type | Interaction Pattern | Time Complexity | |-----------|-------------------|-----------------| | Encoder-based reranker | Bidirectional attention | \(O\left(KL\left((|q|+n)^2 d + (|q|+n)d^2\right)\right)\) | | Decoder-based reranker | Causal attention | Same as above | | **Enc-Dec (ours)** | Merged self-attn. + cross-attn. | \(O\left(\frac{L}{2}K\left(|q|\left(|q|+\lceil n/r\rceil\right)d + (|q|+\lceil n/r\rceil)d^2\right)\right)\) | **Space complexity:** Without MEP, caching \(N\) passages costs \(O(Nnd)\). With MEP ratio \(r\), it reduces to \(O(N \lceil n/r \rceil d)\). **Efficiency gains (Figure 4):** - At \(n=4096\) (long passages): **203.4×** speedup over traditional rerankers. - At compression ratio \(r=8\): **33.3×** speedup; at \(r=4\): **18.5×**; even at \(r=2\): ~10×. ## Empirical Validation / Results ### Benchmarks and Setup - **BEIR:** 13 heterogeneous English retrieval tasks (nDCG@10). - **MIRACL:** 18 multilingual languages. - **LMEB:** 6 dialogue memory retrieval tasks (see Appendix). - First-stage retriever: KaLM-Embedding-V2.5 (top-100 candidates). - Default MEP ratio \(r=4\). ### Main Results on BEIR | Model | Size | Cost (norm.) | Avg. nDCG@10 | |-------|------|--------------|--------------| | Qwen3-Reranker-8B | 8B | 539.7× | 65.11 | | Qwen3-Reranker-4B | 4B | 236.8× | 63.50 | | **KaLM-Reranker-V1-L** | **4B** | **43.7×** | **62.87** | | KaLM-Reranker-V1-S | 1B | 6.9× | 60.01 | | KaLM-Reranker-V1-N | 0.27B | 1.0× | 57.41 | | bge-reranker-large | 0.6B | 36.3× | 51.86 | | gte-reranker-base | 0.3B | 11.9× | 56.77 | | Qwen3-Reranker-0.6B | 0.6B | 42.4× | 59.36 | **Table 4 (excerpt):** BEIR results. Cost normalized to Nano as 1.0. KaLM-Reranker-V1 achieves strong performance at much lower cost. Key observations: - KaLM-Reranker-V1-Large is on par with Qwen3-Reranker-4B but at ~5.5× lower cost. - KaLM-Reranker-V1-Small outperforms Qwen3-Reranker-0.6B at 6.9× vs 42.4× cost. - Nano (0.27B) slightly outperforms gte-reranker-base (0.3B) at ~12× lower cost. ### Main Results on MIRACL (18 languages) | Model | Size | Cost | Avg. nDCG@10 | |-------|------|------|--------------| | **KaLM-Reranker-V1-L** | **4B** | **43.7×** | **70.07** | | bge-reranker-v2-gemma | 2.5B | 81.3× | 69.82 | | KaLM-Reranker-V1-S | 1B | 6.9× | 66.89 | | KaLM-Reranker-V1-N | 0.27B | 1.0× | 62.08 | | mxbai-rerank-large-v2 | 1.5B | 79.2× | 63.26 | | bge-reranker-large | 0.6B | 36.3× | 52.52 | **Table 5 (excerpt):** MIRACL results. Despite limited multilingual training, KaLM-Reranker-V1-Large outperforms bge-reranker-v2-gemma at nearly 2× efficiency. ### Matryoshka Embedding Pooling Analysis (Figure 5) - Performance decreases gradually from \(r=2\) to \(r=16\); drop from \(r=16\) to \(r=32\) is more severe. - Larger models are more robust: e.g., on BEIR, Large degrades less than Small and Nano. - Moderate compression (\(r=2\)–\(8\)) retains most effectiveness while gaining substantial efficiency. ### ROC-AUC Analysis (Figure 6) - As \(r\) increases, AUC decreases; smaller models suffer larger drops. - For FQA (BEIR), Nano AUC drops from 0.871 (\(r=2\)) to 0.832 (\(r=32\)), while Large drops only from 0.952 to 0.948. - Compression divides into two regimes: \(r=2,4,8\) stable; \(r=16,32\) cause noticeable degradation. ## Theoretical and Practical Implications - **Paradigm shift in reranking:** The FBNL design breaks the traditional efficiency–effectiveness trade-off. By decoupling passage encoding from query-passage interaction, it enables offline pre-computation of passage representations (like bi-encoders) while preserving rich cross-attention (like joint encoders). This is a novel middle ground between late-interaction and full-interaction methods. - **Practical scalability:** For large-scale deployments (e.g., RAG with millions of documents), the ability to pre-encode and cache compressed passages yields massive savings in online compute and latency. The 203× efficiency gain for long passages is critical for real-world search systems. - **Flexible storage-efficiency trade-off:** MEP allows practitioners to choose compression ratios based on storage budget. With \(r=4\) (default), performance loss is minimal (<1% on average), making it suitable for most applications. The training with multiple ratios enables a single model to serve at any compression level without retraining. - **Multi-stage training framework:** The progressive three-stage pipeline (general → task-specific → distillation) is a reusable recipe that improves both generalization and fine-grained discrimination. This can be applied to other reranker architectures. - **Implication for multilingual reranking:** Despite limited multilingual data, KaLM-Reranker-V1 shows strong cross-lingual transfer, suggesting that the encoder–decoder architecture with cross-attention may better handle language-agnostic relevance signals. However, Chinese performance remains a bottleneck, indicating the need for better foundation models or more Chinese training data. ## Conclusion - **Main contributions:** 1. A new FBNL reranking paradigm that decouples query and passage computation while preserving expressive cross-attention-based relevance modeling. 2. Matryoshka Embedding Pooling (MEP) for flexible compression of passage representations (1×–32×). 3. Three model sizes (0.27B, 1B, 4B) achieving state-of-the-art efficiency–effectiveness trade-offs on BEIR, MIRACL, and LMEB. - **Key findings:** - KaLM-Reranker-V1 matches or outperforms strong industrial rerankers (Qwen3, BGE, Jina) at substantially lower online cost (up to 203× for long passages). - Moderate MEP compression (\(r=2\)–\(8\)) preserves performance; larger models are more robust to compression. - ROC analysis confirms that discriminative ability is maintained under moderate compression. - **Limitations and future work:** - Current compression method (mean pooling) may lose informative signals at high ratios (\(r=16,32\)). Future work could explore more effective compression methods (e.g., learned pooling, attention-based aggregation). - Multilingual and Chinese reranking capabilities are limited by the foundation model; more extensive multilingual training data and better backbone models are needed. - The current training data size (3.7M samples) is smaller than some competitors (e.g., Qwen3 uses 19M); scaling up data could further improve performance.Related papers
- On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
Parameter-efficient fine-tuning scales one shared foundation model into millions of persistent personal model instances, shown with trillion-parameter LoRA RL.
- Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
A hypernetwork generating repository-specific LoRA adapters for frozen code models achieves 63.8% exact match, outperforming context-injection baselines by +9.9 pp.
- PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM enables parallel multi-region captioning, achieving 3.44× throughput speedup over autoregressive models while maintaining competitive accuracy.