Turning the T IDE: Cross-Architecture Distillation for Diffusion Large Language Models - Summary

Summary (Overview)

  • Pioneering Framework: Introduces T IDE, the first unified framework for cross-architecture knowledge distillation for Diffusion Large Language Models (dLLMs), addressing heterogeneous transfer between models with different architectures, attention mechanisms, and tokenizers.
  • Three Modular Components: Proposes three novel components to overcome specific challenges:
    1. T IDAL: A dual-axis scheduler modulating distillation strength based on both training progress and diffusion timestep to account for the teacher's noise-dependent reliability.
    2. C OMP D EMO: Enriches teacher context via complementary mask splitting, providing better predictions under heavy masking.
    3. Reverse C ALM: A cross-tokenizer objective using inverted chunk-level likelihood matching to yield bounded gradients and dual-end noise filtering.
  • Effective Distillation: Distilling 16B MoE and 8B dense teachers into a 0.6B student outperforms the non-distilled baseline by an average of +1.53 points across eight benchmarks and achieves a +16.48 gain on HumanEval code generation over a same-sized autoregressive model.
  • Practical Efficiency: The distilled 0.6B student requires 22x less memory and runs 5x faster than the 16B teacher, enabling deployment on commodity hardware with minimal inference overhead.

Introduction and Theoretical Foundation

Diffusion Large Language Models (dLLMs) offer parallel decoding and bidirectional context as an alternative to dominant autoregressive (AR) models. However, state-of-the-art dLLMs require billions of parameters (e.g., 8B-100B), posing a significant deployment barrier. While knowledge distillation is well-established for AR models and existing dLLM distillation methods focus on step compression (reducing inference steps within the same architecture), cross-architecture distillation for dLLMs remains unexplored.

This setting introduces three fundamental challenges:

  1. Temporal Inconsistency: The teacher's reliability fluctuates drastically across the diffusion process (timestep-dependent).
  2. Context Scarcity: Severe masking at high noise levels reduces available context, making teacher predictions uninformative.
  3. Vocabulary Misalignment: Distinct tokenizer vocabularies render standard token-level likelihood objectives inapplicable.

T IDE is proposed as a unified framework to overcome these temporal, spatial, and vocabulary barriers through three synergistic components that orchestrate an end-to-end learning pipeline.

Methodology

The T IDE framework distills a large teacher dLLM fTf_T (params θT\theta_T) into a smaller student fSf_S (params θS\theta_S), where they may differ in architecture, attention, and tokenizer. Let x=(x1,...,xL)x = (x_1, ..., x_L) be a clean token sequence and xtx_t be the noised version at diffusion timestep t[ϵ,1)t \in [\epsilon, 1), with masked positions MM replaced by [MASK].

1. Time-Iteration Dual-Axis Lambda Modulation (T IDAL)

This component dynamically modulates distillation strength along two axes to handle temporal inconsistency.

  • Axis 1: Diffusion Timestep: Modulates based on teacher reliability at noise level tt. λt=λtrain×(1t)\lambda_t = \lambda_{train} \times (1 - t) At high noise (t1t \approx 1), λt0\lambda_t \approx 0, avoiding unreliable teacher signals. At low noise (t0t \approx 0), λtλtrain\lambda_t \approx \lambda_{train}, fully relying on the teacher.
  • Axis (2): Training Progress: The base coefficient λtrain\lambda_{train} follows a cosine schedule over normalized progress p[0,1]p \in [0, 1]. λtrain=λinit+(λmaxλinit)×12(1cos(πp))\lambda_{train} = \lambda_{init} + (\lambda_{max} - \lambda_{init}) \times \frac{1}{2}(1 - \cos(\pi \cdot p)) Defaults: λinit=0.1\lambda_{init}=0.1, λmax=0.9\lambda_{max}=0.9. Early training is student-dominated to prevent collapse; later stages shift to teacher supervision.
  • Interpolated Target and Loss: Given student logits ss and teacher logits tt at masked positions, the interpolated target rtr_t is: rt=softmax((1λt)s+λttT)r_t = \text{softmax}\left( \frac{(1 - \lambda_t) \cdot s + \lambda_t \cdot t}{T} \right) The T IDAL loss is then: LTIDAL=DKL(rtsoftmax(sT))×T2\mathcal{L}_{\text{TIDAL}} = D_{KL}\left( r_t \parallel \text{softmax}\left( \frac{s}{T} \right) \right) \times T^2 rtr_t is detached from the graph. An optional midrange timestep weighting w(t)=exp((t0.5)22σ2)w(t) = \exp\left(-\frac{(t-0.5)^2}{2\sigma^2}\right) with σ=0.15\sigma=0.15 can be applied.

2. Complementary Demonstration (C OMP D EMO)

This component enriches teacher context to overcome spatial scarcity under heavy masking.

  • Mask Splitting: The masked set MM is randomly partitioned into two complementary subsets MAM_A and MBM_B such that: MAMB=M,MAMB=,MA/MρM_A \cup M_B = M,\quad M_A \cap M_B = \emptyset,\quad |M_A| / |M| \approx \rho where the demonstration ratio ρ=0.5\rho = 0.5.
  • Two-Pass Teacher Inference: Perform two forward passes through the frozen teacher:
    • Pass 1: t(1)=fT(reveal MA, mask MB)t^{(1)} = f_T(\text{reveal } M_A, \text{ mask } M_B) \rightarrow logits at MBM_B
    • Pass 2: t(2)=fT(reveal MB, mask MA)t^{(2)} = f_T(\text{reveal } M_B, \text{ mask } M_A) \rightarrow logits at MAM_A The merged final logits are: tfinal[MB]t(1)[MB]t_{\text{final}}[M_B] \leftarrow t^{(1)}[M_B] and tfinal[MA]t(2)[MA]t_{\text{final}}[M_A] \leftarrow t^{(2)}[M_A].
  • This strategy doubles teacher forward passes but increases total training time by only ~50% as the teacher is frozen.

3. Distillation Objectives & Reverse CALM

T IDE supports two pipelines with tailored objectives.

  • Shared-Tokenizer Objective (WeDLM → BD3LM): Teacher and student share tokenizer. The loss combines Cross-Entropy (CE) and T IDAL:

    LB=LCE+wtidalLTIDAL\mathcal{L}_B = \mathcal{L}_{\text{CE}} + w_{\text{tidal}} \cdot \mathcal{L}_{\text{TIDAL}}

    C OMP D EMO can be optionally integrated.

  • Cross-Tokenizer Objective (LLaDA2 → BD3LM): Teacher and student have different vocabularies (VTVSV_T \neq V_S). Chunk-level Approximate Likelihood Matching (C ALM) is introduced, aligning token sequences at the byte level to identify chunks.

    • Chunk-Level Log-Probabilities: For each token xix_i, logP(xi)=logitsxilogsumexp(logits)\log P(x_i) = \text{logits}_{x_i} - \text{logsumexp}(\text{logits}). Chunk-level log-probabilities are obtained via alignment matrices ASA_S and ATA_T: LPS=lpSASRb×C,LPT=lpTATRb×C\text{LP}_S = \text{lp}_S \cdot A_S \in \mathbb{R}^{b \times C}, \quad \text{LP}_T = \text{lp}_T \cdot A_T \in \mathbb{R}^{b \times C} Chunk probabilities: psc=exp(LPSc/T)p^c_s = \exp(\text{LP}^c_S / T) and ptc=exp(LPTc/T)p^c_t = \exp(\text{LP}^c_T / T).
    • Forward C ALM (Baseline): Applies Binary Cross-Entropy (BCE): LFwd-CALM=[ptclogpsc+(1ptc)log(1psc)]\mathcal{L}_{\text{Fwd-CALM}} = -[p^c_t \log p^c_s + (1 - p^c_t) \log(1 - p^c_s)] Can be integrated with T IDAL: pmix=(1λt)psc+λtptcp_{\text{mix}} = (1 - \lambda_t) \cdot p^c_s + \lambda_t \cdot p^c_t.
    • Reverse C ALM (Proposed): To address gradient explosion (ptc/pscp^c_t / p^c_s divergence) in Forward CALM, the BCE direction is reversed: LRev-CALM=[psclogptc+(1psc)log(1ptc)]\mathcal{L}_{\text{Rev-CALM}} = -[p^c_s \log p^c_t + (1 - p^c_s) \log(1 - p^c_t)] This yields bounded gradients (coefficient logptc1ptc\log \frac{p^c_t}{1 - p^c_t} depends only on fixed teacher) and provides dual-end noise filtering. It is equivalent to minimizing the Bernoulli KL divergence KLBern(pscptc)KL_{\text{Bern}}(p^c_s \parallel p^c_t).
    • The cross-tokenizer training objective is: LA=LCE+wcalmLdist,where Ldist{LCALM-TIDAL,LRev-CALM}\mathcal{L}_A = \mathcal{L}_{\text{CE}} + w_{\text{calm}} \cdot \mathcal{L}_{\text{dist}}, \quad \text{where } \mathcal{L}_{\text{dist}} \in \{\mathcal{L}_{\text{CALM-TIDAL}}, \mathcal{L}_{\text{Rev-CALM}}\}

Empirical Validation / Results

Experimental Setup

  • Student Model: Qwen3-0.6B-BD3LM (0.6B block diffusion model).
  • Teacher Models:
    • Pipeline A (Cross-Tokenizer): LLaDA2.0-mini (16B MoE, independent tokenizer).
    • Pipeline B (Shared-Tokenizer): WeDLM-8B-Instruct (8B dense, shared Qwen tokenizer).
  • Training: 10 epochs, LR=5e-5, sequence length 512, on combined SFT datasets (Tulu-3, SmolTalk, OpenCoder).
  • Evaluation: 8 benchmarks: GSM8K, MATH, BBH, MMLU-Pro, HellaSwag, MMLU, HumanEval, MBPP.
  • Baselines: AR model (Qwen3-0.6B-Base) and non-distilled BD3LM from Zhou et al. (2026).

Main Results

Table 1: Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; underline: second best.

BenchmarkAR (Qwen3-0.6B)BD3LM (No Distill)Shared-Tokenizer PipelineCross-Tokenizer Pipeline
KLT IDE-SharedT IDE-CrossCALMT IDE-Shared
GSM8K59.6045.5643.9745.0348.9848.6049.89
MATH32.4013.089.409.7611.1613.1412.98
BBH41.5026.3225.7926.0026.7924.2126.85
MMLU-Pro24.7013.8013.1912.8814.4813.4714.02
HellaSwag47.4039.2839.7839.5040.5040.4239.57
MMLU52.8039.1539.5739.0939.9239.4239.54
HumanEval32.3046.3441.4642.6848.7843.9049.39
MBPP36.6037.8031.2031.4037.8034.8038.40
Avg40.9132.6730.5530.7933.5532.2533.83
  • Cross-Architecture Distillation Is Effective: Both T IDE pipelines outperform the non-distilled BD3LM baseline (Avg 32.67). The cross-tokenizer T IDE-Cross strategy achieves the highest average score (34.20), and shared-tokenizer T IDE-Shared reaches 33.55.
  • Each Pipeline Favors Its Native Strategy:
    • Cross-tokenizer pipeline prefers T IDE-Cross (Reverse CALM) by +0.37 avg, suited for alignment noise.
    • Shared-tokenizer pipeline prefers T IDE-Shared (T IDAL + C OMP D EMO) by +2.76 avg, effective with exact token alignment.
  • Distilled dLLMs Excel at Code Generation: On HumanEval, T IDE-Shared (shared) scores 48.78 and T IDE-Cross (cross) scores 48.17, substantially exceeding the AR baseline (32.30). This suggests parallel diffusion decoding benefits structured output generation.

Ablation Studies

Ablations on the shared-tokenizer pipeline (T IDE-Shared strategy) isolate component contributions.

Table 2: Component-level ablation on the shared-tokenizer pipeline (WeDLM → Qwen3-BD3LM). Bold: best per row.

BenchmarkBaseline (w/o Train)w/o Tstepw/o C OMP D EMOFull (T IDAL + C OMP D EMO)
GSM8K48.0748.8248.9048.90
MATH11.7411.9611.8411.68
BBH26.3726.5126.7726.66
MMLU-Pro14.1214.4213.7613.76
HellaSwag40.0340.3540.1640.27
MMLU39.8139.8439.5839.92
HumanEval45.7343.9044.5146.95
MBPP38.6037.2038.2037.00
Avg33.0632.8832.9733.14
  • Timestep Axis Is Most Impactful: Removing it causes the largest avg drop (-0.26), with a -3.05 drop on HumanEval, validating its necessity.
  • C OMP D EMO Provides Consistent Gains: Removing it reduces avg by -0.17, with notable drops on HumanEval (-2.44) and MMLU (-0.34).
  • Full Framework Outperforms Baseline: The complete T IDE (dual-axis + C OMP D EMO) outperforms the timestep-only baseline, stabilizing early training.

Inference Efficiency

Table 3: Inference efficiency comparison (controlled setting). Peak memory, latency, and throughput are measured on a single H100-80GB GPU generating 256 tokens in bfloat16.

ModelParams (B)Peak Mem (GB)Latency (s)Tokens/s
Student (BD3LM-0.6B)
Distilled0.601.46.2541.0
No Distill0.601.46.0842.1
AR Baseline
Qwen3-0.6B-Base0.601.24.9951.3
Teachers
WeDLM-8B-Instruct8.1915.56.7937.7
LLaDA2.0-mini16.2631.332.557.8
  • Distillation Enables Practical Deployment: The distilled student requires only 1.4 GB peak memory (22x reduction vs. LLaDA2's 31.3 GB) and is 5.2x faster (6.25s vs. 32.55s).
  • Distillation Adds Minimal Overhead: Compared to the undistilled BD3LM, distillation introduces only a 2.6% throughput reduction (41.0 vs. 42.1 tokens/s) and identical memory footprint. The iterative diffusion process makes