Turning the T IDE: Cross-Architecture Distillation for Diffusion Large Language Models - Summary

Summary (Overview)

Pioneering Framework: Introduces T IDE, the first unified framework for cross-architecture knowledge distillation for Diffusion Large Language Models (dLLMs), addressing heterogeneous transfer between models with different architectures, attention mechanisms, and tokenizers.
Three Modular Components: Proposes three novel components to overcome specific challenges:
1. T IDAL: A dual-axis scheduler modulating distillation strength based on both training progress and diffusion timestep to account for the teacher's noise-dependent reliability.
2. C OMP D EMO: Enriches teacher context via complementary mask splitting, providing better predictions under heavy masking.
3. Reverse C ALM: A cross-tokenizer objective using inverted chunk-level likelihood matching to yield bounded gradients and dual-end noise filtering.
Effective Distillation: Distilling 16B MoE and 8B dense teachers into a 0.6B student outperforms the non-distilled baseline by an average of +1.53 points across eight benchmarks and achieves a +16.48 gain on HumanEval code generation over a same-sized autoregressive model.
Practical Efficiency: The distilled 0.6B student requires 22x less memory and runs 5x faster than the 16B teacher, enabling deployment on commodity hardware with minimal inference overhead.

Introduction and Theoretical Foundation

Diffusion Large Language Models (dLLMs) offer parallel decoding and bidirectional context as an alternative to dominant autoregressive (AR) models. However, state-of-the-art dLLMs require billions of parameters (e.g., 8B-100B), posing a significant deployment barrier. While knowledge distillation is well-established for AR models and existing dLLM distillation methods focus on step compression (reducing inference steps within the same architecture), cross-architecture distillation for dLLMs remains unexplored.

This setting introduces three fundamental challenges:

Temporal Inconsistency: The teacher's reliability fluctuates drastically across the diffusion process (timestep-dependent).
Context Scarcity: Severe masking at high noise levels reduces available context, making teacher predictions uninformative.
Vocabulary Misalignment: Distinct tokenizer vocabularies render standard token-level likelihood objectives inapplicable.

T IDE is proposed as a unified framework to overcome these temporal, spatial, and vocabulary barriers through three synergistic components that orchestrate an end-to-end learning pipeline.

Methodology

The T IDE framework distills a large teacher dLLM $f_T$ (params $\theta_T$ ) into a smaller student $f_S$ (params $\theta_S$ ), where they may differ in architecture, attention, and tokenizer. Let $x = (x_1, ..., x_L)$ be a clean token sequence and $x_t$ be the noised version at diffusion timestep $t \in [\epsilon, 1)$ , with masked positions $M$ replaced by [MASK].

1. Time-Iteration Dual-Axis Lambda Modulation (T IDAL)

This component dynamically modulates distillation strength along two axes to handle temporal inconsistency.

Axis 1: Diffusion Timestep: Modulates based on teacher reliability at noise level $t$ . $\lambda_t = \lambda_{train} \times (1 - t)$ At high noise ( $t \approx 1$ ), $\lambda_t \approx 0$ , avoiding unreliable teacher signals. At low noise ( $t \approx 0$ ), $\lambda_t \approx \lambda_{train}$ , fully relying on the teacher.
Axis (2): Training Progress: The base coefficient $\lambda_{train}$ follows a cosine schedule over normalized progress $p \in [0, 1]$ . $\lambda_{train} = \lambda_{init} + (\lambda_{max} - \lambda_{init}) \times \frac{1}{2}(1 - \cos(\pi \cdot p))$ Defaults: $\lambda_{init}=0.1$ , $\lambda_{max}=0.9$ . Early training is student-dominated to prevent collapse; later stages shift to teacher supervision.
Interpolated Target and Loss: Given student logits $s$ and teacher logits $t$ at masked positions, the interpolated target $r_t$ is: $r_t = \text{softmax}\left( \frac{(1 - \lambda_t) \cdot s + \lambda_t \cdot t}{T} \right)$ The T IDAL loss is then: $\mathcal{L}_{\text{TIDAL}} = D_{KL}\left( r_t \parallel \text{softmax}\left( \frac{s}{T} \right) \right) \times T^2$ $r_t$ is detached from the graph. An optional midrange timestep weighting $w(t) = \exp\left(-\frac{(t-0.5)^2}{2\sigma^2}\right)$ with $\sigma=0.15$ can be applied.

2. Complementary Demonstration (C OMP D EMO)

This component enriches teacher context to overcome spatial scarcity under heavy masking.

Mask Splitting: The masked set $M$ is randomly partitioned into two complementary subsets $M_A$ and $M_B$ such that: $M_A \cup M_B = M,\quad M_A \cap M_B = \emptyset,\quad |M_A| / |M| \approx \rho$ where the demonstration ratio $\rho = 0.5$ .
Two-Pass Teacher Inference: Perform two forward passes through the frozen teacher:
- Pass 1: $t^{(1)} = f_T(\text{reveal } M_A, \text{ mask } M_B) \rightarrow$ logits at $M_B$
- Pass 2: $t^{(2)} = f_T(\text{reveal } M_B, \text{ mask } M_A) \rightarrow$ logits at $M_A$ The merged final logits are: $t_{\text{final}}[M_B] \leftarrow t^{(1)}[M_B]$ and $t_{\text{final}}[M_A] \leftarrow t^{(2)}[M_A]$ .
This strategy doubles teacher forward passes but increases total training time by only ~50% as the teacher is frozen.

3. Distillation Objectives & Reverse CALM

T IDE supports two pipelines with tailored objectives.

Shared-Tokenizer Objective (WeDLM → BD3LM): Teacher and student share tokenizer. The loss combines Cross-Entropy (CE) and T IDAL:
$\mathcal{L}_B = \mathcal{L}_{\text{CE}} + w_{\text{tidal}} \cdot \mathcal{L}_{\text{TIDAL}}$
C OMP D EMO can be optionally integrated.
Cross-Tokenizer Objective (LLaDA2 → BD3LM): Teacher and student have different vocabularies ( $V_T \neq V_S$ ). Chunk-level Approximate Likelihood Matching (C ALM) is introduced, aligning token sequences at the byte level to identify chunks.
- Chunk-Level Log-Probabilities: For each token $x_i$ , $\log P(x_i) = \text{logits}_{x_i} - \text{logsumexp}(\text{logits})$ . Chunk-level log-probabilities are obtained via alignment matrices $A_S$ and $A_T$ : $\text{LP}_S = \text{lp}_S \cdot A_S \in \mathbb{R}^{b \times C}, \quad \text{LP}_T = \text{lp}_T \cdot A_T \in \mathbb{R}^{b \times C}$ Chunk probabilities: $p^c_s = \exp(\text{LP}^c_S / T)$ and $p^c_t = \exp(\text{LP}^c_T / T)$ .
- Forward C ALM (Baseline): Applies Binary Cross-Entropy (BCE): $\mathcal{L}_{\text{Fwd-CALM}} = -[p^c_t \log p^c_s + (1 - p^c_t) \log(1 - p^c_s)]$ Can be integrated with T IDAL: $p_{\text{mix}} = (1 - \lambda_t) \cdot p^c_s + \lambda_t \cdot p^c_t$ .
- Reverse C ALM (Proposed): To address gradient explosion ( $p^c_t / p^c_s$ divergence) in Forward CALM, the BCE direction is reversed: $\mathcal{L}_{\text{Rev-CALM}} = -[p^c_s \log p^c_t + (1 - p^c_s) \log(1 - p^c_t)]$ This yields bounded gradients (coefficient $\log \frac{p^c_t}{1 - p^c_t}$ depends only on fixed teacher) and provides dual-end noise filtering. It is equivalent to minimizing the Bernoulli KL divergence $KL_{\text{Bern}}(p^c_s \parallel p^c_t)$ .
- The cross-tokenizer training objective is: $\mathcal{L}_A = \mathcal{L}_{\text{CE}} + w_{\text{calm}} \cdot \mathcal{L}_{\text{dist}}, \quad \text{where } \mathcal{L}_{\text{dist}} \in \{\mathcal{L}_{\text{CALM-TIDAL}}, \mathcal{L}_{\text{Rev-CALM}}\}$

Empirical Validation / Results

Experimental Setup

Student Model: Qwen3-0.6B-BD3LM (0.6B block diffusion model).
Teacher Models:
- Pipeline A (Cross-Tokenizer): LLaDA2.0-mini (16B MoE, independent tokenizer).
- Pipeline B (Shared-Tokenizer): WeDLM-8B-Instruct (8B dense, shared Qwen tokenizer).
Training: 10 epochs, LR=5e-5, sequence length 512, on combined SFT datasets (Tulu-3, SmolTalk, OpenCoder).
Evaluation: 8 benchmarks: GSM8K, MATH, BBH, MMLU-Pro, HellaSwag, MMLU, HumanEval, MBPP.
Baselines: AR model (Qwen3-0.6B-Base) and non-distilled BD3LM from Zhou et al. (2026).

Main Results

Table 1: Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; underline: second best.

Benchmark	AR (Qwen3-0.6B)	BD3LM (No Distill)	Shared-Tokenizer Pipeline		Cross-Tokenizer Pipeline
			KL	T IDE-Shared	T IDE-Cross	CALM	T IDE-Shared
GSM8K	59.60	45.56	43.97	45.03	48.98	48.60	49.89
MATH	32.40	13.08	9.40	9.76	11.16	13.14	12.98
BBH	41.50	26.32	25.79	26.00	26.79	24.21	26.85
MMLU-Pro	24.70	13.80	13.19	12.88	14.48	13.47	14.02
HellaSwag	47.40	39.28	39.78	39.50	40.50	40.42	39.57
MMLU	52.80	39.15	39.57	39.09	39.92	39.42	39.54
HumanEval	32.30	46.34	41.46	42.68	48.78	43.90	49.39
MBPP	36.60	37.80	31.20	31.40	37.80	34.80	38.40
Avg	40.91	32.67	30.55	30.79	33.55	32.25	33.83

Cross-Architecture Distillation Is Effective: Both T IDE pipelines outperform the non-distilled BD3LM baseline (Avg 32.67). The cross-tokenizer T IDE-Cross strategy achieves the highest average score (34.20), and shared-tokenizer T IDE-Shared reaches 33.55.
Each Pipeline Favors Its Native Strategy:
- Cross-tokenizer pipeline prefers T IDE-Cross (Reverse CALM) by +0.37 avg, suited for alignment noise.
- Shared-tokenizer pipeline prefers T IDE-Shared (T IDAL + C OMP D EMO) by +2.76 avg, effective with exact token alignment.
Distilled dLLMs Excel at Code Generation: On HumanEval, T IDE-Shared (shared) scores 48.78 and T IDE-Cross (cross) scores 48.17, substantially exceeding the AR baseline (32.30). This suggests parallel diffusion decoding benefits structured output generation.

Ablation Studies

Ablations on the shared-tokenizer pipeline (T IDE-Shared strategy) isolate component contributions.

Table 2: Component-level ablation on the shared-tokenizer pipeline (WeDLM → Qwen3-BD3LM). Bold: best per row.

Benchmark	Baseline (w/o Train)	w/o Tstep	w/o C OMP D EMO	Full (T IDAL + C OMP D EMO)
GSM8K	48.07	48.82	48.90	48.90
MATH	11.74	11.96	11.84	11.68
BBH	26.37	26.51	26.77	26.66
MMLU-Pro	14.12	14.42	13.76	13.76
HellaSwag	40.03	40.35	40.16	40.27
MMLU	39.81	39.84	39.58	39.92
HumanEval	45.73	43.90	44.51	46.95
MBPP	38.60	37.20	38.20	37.00
Avg	33.06	32.88	32.97	33.14

Timestep Axis Is Most Impactful: Removing it causes the largest avg drop (-0.26), with a -3.05 drop on HumanEval, validating its necessity.
C OMP D EMO Provides Consistent Gains: Removing it reduces avg by -0.17, with notable drops on HumanEval (-2.44) and MMLU (-0.34).
Full Framework Outperforms Baseline: The complete T IDE (dual-axis + C OMP D EMO) outperforms the timestep-only baseline, stabilizing early training.

Inference Efficiency

Table 3: Inference efficiency comparison (controlled setting). Peak memory, latency, and throughput are measured on a single H100-80GB GPU generating 256 tokens in bfloat16.

Model	Params (B)	Peak Mem (GB)	Latency (s)	Tokens/s
Student (BD3LM-0.6B)
Distilled	0.60	1.4	6.25	41.0
No Distill	0.60	1.4	6.08	42.1
AR Baseline
Qwen3-0.6B-Base	0.60	1.2	4.99	51.3
Teachers
WeDLM-8B-Instruct	8.19	15.5	6.79	37.7
LLaDA2.0-mini	16.26	31.3	32.55	7.8

Distillation Enables Practical Deployment: The distilled student requires only 1.4 GB peak memory (22x reduction vs. LLaDA2's 31.3 GB) and is 5.2x faster (6.25s vs. 32.55s).
Distillation Adds Minimal Overhead: Compared to the undistilled BD3LM, distillation introduces only a 2.6% throughput reduction (41.0 vs. 42.1 tokens/s) and identical memory footprint. The iterative diffusion process makes