Effective Distillation to Hybrid xLSTM Architectures

Summary (Overview)

Goal: Achieve "lossless distillation" of quadratic attention-based LLMs into sub-quadratic xLSTM-based architectures, defined by a high tolerance-corrected Win-and-Tie rate ( $C_\alpha$ ) across diverse tasks.
Key Method: Introduces a distillation pipeline featuring a hybrid mLSTM-SWA architecture and an optional expert merging stage. The hybrid combines a global mLSTM (for long-range dependencies) with local Sliding Window Attention (SWA) and sink tokens, gated dynamically.
Main Results: Distilled xLSTM students (from Llama, Qwen, Olmo families) recover most teacher performance, often exceeding it on specific downstream tasks (e.g., code generation). They achieve significantly higher $C_\alpha$ and lower critical tolerance $\alpha^*$ than prior linearization methods (LoLCATs, RADLADS, Mamba-in-Llama).
Inference Efficiency: The xLSTM-based students demonstrate substantial inference advantages: ~2x higher prefill throughput, ~2x reduction in time-to-first-token, ~4x higher generation throughput for long contexts, and constant memory usage during decoding.
Modular Capability Development: Demonstrates that weight-space merging (Eq. 14) of independently distilled domain experts (math, code, STEM, chat) into a single model is effective, enabling decentralized and modular linearization.

Introduction and Theoretical Foundation

Current Transformer-based LLMs are computationally expensive due to their quadratic attention mechanisms. Distillation into sub-quadratic architectures aims to create efficient drop-in replacements, but prior methods often fail to match teacher performance on harder generative tasks (math, code reasoning).

The paper formalizes the goal of lossless distillation via the Win-and-Tie rate $C_\alpha$ , defined as the fraction of benchmarks where the student matches or exceeds teacher performance within a tolerance $\alpha$ . The critical tolerance $\alpha^*$ is the minimum $\alpha$ such that $C_\alpha \geq 0.5$ . Lower $\alpha^*$ indicates a better, more reliable student.

xLSTM (Beck et al., 2024) is identified as a powerful linear-complexity alternative. The proposed method hybridizes xLSTM's mLSTM cell with sparse Sliding Window Attention (SWA) and sink tokens using learned gates, conceptually blending quadratic KV memory with linear fast-weight memory.

Methodology

Architecture & Student Initialization

The student architecture mirrors the teacher (a pre-trained causal Transformer) but replaces each multi-head attention block with a hybrid of SWA and mLSTM.

Hybrid Output Computation: The final output $\hat{h}_t$ combines the global mLSTM and local SWA+sink outputs via a data-dependent, per-head scalar output gate $o_t$ :

\hat{h}_t = o_t \, \text{mLSTM}(q_t) + (1 - o_t) \, \text{SWA}(q_t) = o_t \frac{\phi(q_t) S_t}{\phi(q_t) z_t} + (1 - o_t) \, \text{sm}\left(\frac{q_t K^W_t^\top}{\sqrt{d_{qk}}}\right) V^W_t

where $\text{sm}$ denotes softmax.

Key mLSTM Adaptations:

Uses the original normalizer design (Eq. 10) without added normalization layers.
Uses per-head scalar output gates instead of per-channel gates.
Input to output gate projections uses concatenated head inputs $[q_t k_t v_t]$ .
Query/key inputs to mLSTM use head-wise feature maps $\phi$ with softmax over features.

SWA & Sinks: SWA uses a fixed window of 512 tokens plus 4 initial sink tokens per sequence.

Linearization Fine-Tuning Pipeline

Stage I: Layer-wise Hidden-State Alignment Align student's per-layer representations to teacher's attention outputs using MSE loss. Teacher embedding and MLP weights are frozen. For layer $\ell$ and step $t$ :

\min_{\theta_\ell} \| h^{(ℓ)}_t - \hat{h}^{(ℓ)}_t \|_2^2

where $\theta_\ell$ are newly introduced parameters (feature maps, gate projections).

Stage II: Sparse Knowledge Distillation Unfreeze all student parameters $\theta$ and fine-tune end-to-end with a mixed objective:

\min_{\theta} \left\{ -\sum_{t=1}^T \gamma \log p_\theta(y_t | x_{1:t}) + \beta \, \text{KL}\left[ p^{(k)}_T(\cdot | x_{1:t}) \| p^{(k)}_\theta(\cdot | x_{1:t}) \right] \right\}

$\gamma=0.9$ , $\beta=0.1$ for cross-entropy (CE) and sparse KL divergence (top- $k=256$ tokens).
Sparse KL allows precomputing teacher targets, avoiding online teacher queries during long-context distillation.

Stage III (Optional): Expert Merging Train $K$ domain experts $\{\theta^{(i)}\}_{i=1}^K$ independently from the same initialized seed $\theta^{(0)}$ . Merge into a single student via linear weight merging:

\theta_{\text{merge}} = \sum_{i=1}^K \lambda_i \theta^{(i)}, \quad \lambda_i \geq 0, \quad \sum_{i=1}^K \lambda_i = 1

Default: uniform weights $\lambda_i = 1/K$ . Enables capability patching.

Empirical Validation / Results

Evaluation Metrics

Teacher-Recovery Rate: Ratio of student/teacher performance on a benchmark. >1 indicates student exceeds teacher.
Win-and-Tie Rate ( $C_\alpha$ ): Fraction of benchmarks where student matches/exceeds teacher within tolerance $\alpha$ .
Critical Tolerance ( $\alpha^*$ ): Minimum $\alpha$ such that $C_\alpha \geq 0.5$ .

Base Model Evaluation (Llama3.1-8B, Olmo3-7B)

Language Understanding Tasks (MMLU, HellaSwag, etc.):

xLSTM students achieve full or near-full teacher parity.
Prior methods (LoLCATs, QRWKV6-7B) show significant gaps.

Language Generation & Reasoning Tasks (GSM8K, HumanEval, etc.):

Prior methods exhibit large performance gaps ( $\alpha^* = 1.0$ ).
xLSTM hybrids achieve strong recovery: $\alpha^* = 0.0$ for Llama3.1-8B, $\alpha^* = 0.01$ for Olmo3-7B.

Key Result Table (Recovery Rates - Base Models):

Model (Teacher)	PIQA	ARC-e	ARC-c	HellaSwag	Winogrande	MMLU	GSM8K	HumanEval	MBPP
xLSTM-Llama3.1-8B	1.02	1.00	0.97	1.00	1.03	1.00	1.67	1.14	1.19
LoLCATs	0.99	0.92	0.96	0.80	1.01	0.95	0.17	0.08	0.06
xLSTM-Olmo3-7B	1.00	0.99	0.97	1.00	0.99	0.99	1.10	0.80	0.88
QRWKV6-7B	0.99	1.00	0.97	0.87	1.00	0.97	0.30	0.43	0.29

Instruction-Tuned Model Evaluation (Llama3.1-8B-IT, Qwen2.5-7B-IT)

Decentralized Linearization: Four domain experts (math, STEM, code, instruction/chat) distilled independently, then merged.

Results vs. Baselines:

xLSTM-Llama3.1-8B-IT vs. Mamba-in-Llama: xLSTM student matches/exceeds teacher on many tasks (e.g., MATH 500: 1.05 recovery), while baseline shows large deficits (e.g., GSM8K: 0.71 recovery).
xLSTM-Qwen2.5-7B-IT vs. QRWKV7-7B-IT: xLSTM student shows strong recovery, especially in math (MATH: 0.89) and code (HumanEval+: 1.03), outperforming baseline.
Win-and-Tie Rates: xLSTM students achieve $\alpha^* = 0.02$ (Llama) and $\alpha^* = 0.05$ (Qwen), indicating near-lossless distillation.

Effect of Merging: Merging improves overall capability coverage, especially instruction-following (IFEval). Some interference observed on STEM tasks (GPQA). Math and code capabilities remain robust.

Ablations

Components: Pure mLSTM outperforms pure linear attention. mLSTM + SWA + Sinks combination yields best performance.
Distillation Objective: Mixed objective ( $\gamma=0.9$ , $\beta=0.1$ ) outperforms pure KL distillation.
Fine-tuning Method: Full Fine-Tuning (FFT) significantly outperforms Parameter-Efficient Fine-Tuning (PEFT/LoRA).

Inference Comparison

Prefill (Prompt Encoding):

Student has ~2x higher throughput at batch size $B=1$ , context length $C=65K$ .
~2x reduction in Time-To-First-Token (TTFT).

Generation (Autoregressive Decoding):

Student halves latency and GPU memory usage at generation budget $G=131K$ ( $B=1$ ).
Student maintains constant memory over time; teacher's memory grows.
With prefill and $B=8$ , student achieves up to ~4x higher generation throughput as context length increases.

Theoretical and Practical Implications

Formalized Evaluation: Introduces $C_\alpha$ and $\alpha^*$ as rigorous metrics for assessing "lossless distillation" and reliability of drop-in replacements.
Effective Distillation Pipeline: Provides a recipe (hybrid architecture, two-stage fine-tuning, optional merging) that successfully transfers capabilities from quadratic to sub-quadratic models.
Modularity & Efficiency: The expert merging stage enables decentralized, parallel development of domain-specific efficient models, which can be consolidated into a single deployable model. This supports targeted updates and capability patching.
Inference Advantages: The xLSTM-based hybrid offers substantial improvements in latency, throughput, and memory consumption, making it a compelling candidate for efficient deployment.
Architectural Contribution: Demonstrates the effectiveness of hybridizing mLSTM (global, linear) with SWA+sinks (local, sparse) via data-dependent gating, capturing both short and long-term dependencies.

Conclusion

The proposed distillation pipeline successfully creates xLSTM-based hybrid students that recover most teacher performance across diverse benchmarks, formalized by high Win-and-Tie rates $C_\alpha$ . The method outperforms prior linearization approaches and demonstrates strong inference efficiency benefits.

Key Takeaways:

Lossless distillation to sub-quadratic architectures is achievable with the proposed hybrid mLSTM-SWA design and fine-tuning pipeline.
Weight-space merging remains effective after linearization, enabling modular capability development.
The distilled xLSTM models are prime candidates for drop-in replacement of Transformer-based LLMs when inference efficiency is critical.

Limitations & Future Work: Remaining gaps on synthetic long-context evaluations and some reasoning benchmarks; interference between merged experts. Future directions include scaling to larger teachers (e.g., MoE models), exploring stronger attention hybrids for long contexts, and studying on-policy distillation or RL-based expert refinement before merging.