Turning the T IDE: Cross-Architecture Distillation for Diffusion Large Language Models - Summary
Summary (Overview)
- Pioneering Framework: Introduces T IDE, the first unified framework for cross-architecture knowledge distillation for Diffusion Large Language Models (dLLMs), addressing heterogeneous transfer between models with different architectures, attention mechanisms, and tokenizers.
- Three Modular Components: Proposes three novel components to overcome specific challenges:
- T IDAL: A dual-axis scheduler modulating distillation strength based on both training progress and diffusion timestep to account for the teacher's noise-dependent reliability.
- C OMP D EMO: Enriches teacher context via complementary mask splitting, providing better predictions under heavy masking.
- Reverse C ALM: A cross-tokenizer objective using inverted chunk-level likelihood matching to yield bounded gradients and dual-end noise filtering.
- Effective Distillation: Distilling 16B MoE and 8B dense teachers into a 0.6B student outperforms the non-distilled baseline by an average of +1.53 points across eight benchmarks and achieves a +16.48 gain on HumanEval code generation over a same-sized autoregressive model.
- Practical Efficiency: The distilled 0.6B student requires 22x less memory and runs 5x faster than the 16B teacher, enabling deployment on commodity hardware with minimal inference overhead.
Introduction and Theoretical Foundation
Diffusion Large Language Models (dLLMs) offer parallel decoding and bidirectional context as an alternative to dominant autoregressive (AR) models. However, state-of-the-art dLLMs require billions of parameters (e.g., 8B-100B), posing a significant deployment barrier. While knowledge distillation is well-established for AR models and existing dLLM distillation methods focus on step compression (reducing inference steps within the same architecture), cross-architecture distillation for dLLMs remains unexplored.
This setting introduces three fundamental challenges:
- Temporal Inconsistency: The teacher's reliability fluctuates drastically across the diffusion process (timestep-dependent).
- Context Scarcity: Severe masking at high noise levels reduces available context, making teacher predictions uninformative.
- Vocabulary Misalignment: Distinct tokenizer vocabularies render standard token-level likelihood objectives inapplicable.
T IDE is proposed as a unified framework to overcome these temporal, spatial, and vocabulary barriers through three synergistic components that orchestrate an end-to-end learning pipeline.
Methodology
The T IDE framework distills a large teacher dLLM (params ) into a smaller student (params ), where they may differ in architecture, attention, and tokenizer. Let be a clean token sequence and be the noised version at diffusion timestep , with masked positions replaced by [MASK].
1. Time-Iteration Dual-Axis Lambda Modulation (T IDAL)
This component dynamically modulates distillation strength along two axes to handle temporal inconsistency.
- Axis 1: Diffusion Timestep: Modulates based on teacher reliability at noise level . At high noise (), , avoiding unreliable teacher signals. At low noise (), , fully relying on the teacher.
- Axis (2): Training Progress: The base coefficient follows a cosine schedule over normalized progress . Defaults: , . Early training is student-dominated to prevent collapse; later stages shift to teacher supervision.
- Interpolated Target and Loss: Given student logits and teacher logits at masked positions, the interpolated target is: The T IDAL loss is then: is detached from the graph. An optional midrange timestep weighting with can be applied.
2. Complementary Demonstration (C OMP D EMO)
This component enriches teacher context to overcome spatial scarcity under heavy masking.
- Mask Splitting: The masked set is randomly partitioned into two complementary subsets and such that: where the demonstration ratio .
- Two-Pass Teacher Inference: Perform two forward passes through the frozen teacher:
- Pass 1: logits at
- Pass 2: logits at The merged final logits are: and .
- This strategy doubles teacher forward passes but increases total training time by only ~50% as the teacher is frozen.
3. Distillation Objectives & Reverse CALM
T IDE supports two pipelines with tailored objectives.
-
Shared-Tokenizer Objective (WeDLM → BD3LM): Teacher and student share tokenizer. The loss combines Cross-Entropy (CE) and T IDAL:
C OMP D EMO can be optionally integrated.
-
Cross-Tokenizer Objective (LLaDA2 → BD3LM): Teacher and student have different vocabularies (). Chunk-level Approximate Likelihood Matching (C ALM) is introduced, aligning token sequences at the byte level to identify chunks.
- Chunk-Level Log-Probabilities: For each token , . Chunk-level log-probabilities are obtained via alignment matrices and : Chunk probabilities: and .
- Forward C ALM (Baseline): Applies Binary Cross-Entropy (BCE): Can be integrated with T IDAL: .
- Reverse C ALM (Proposed): To address gradient explosion ( divergence) in Forward CALM, the BCE direction is reversed: This yields bounded gradients (coefficient depends only on fixed teacher) and provides dual-end noise filtering. It is equivalent to minimizing the Bernoulli KL divergence .
- The cross-tokenizer training objective is:
Empirical Validation / Results
Experimental Setup
- Student Model: Qwen3-0.6B-BD3LM (0.6B block diffusion model).
- Teacher Models:
- Pipeline A (Cross-Tokenizer): LLaDA2.0-mini (16B MoE, independent tokenizer).
- Pipeline B (Shared-Tokenizer): WeDLM-8B-Instruct (8B dense, shared Qwen tokenizer).
- Training: 10 epochs, LR=5e-5, sequence length 512, on combined SFT datasets (Tulu-3, SmolTalk, OpenCoder).
- Evaluation: 8 benchmarks: GSM8K, MATH, BBH, MMLU-Pro, HellaSwag, MMLU, HumanEval, MBPP.
- Baselines: AR model (Qwen3-0.6B-Base) and non-distilled BD3LM from Zhou et al. (2026).
Main Results
Table 1: Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; underline: second best.
| Benchmark | AR (Qwen3-0.6B) | BD3LM (No Distill) | Shared-Tokenizer Pipeline | Cross-Tokenizer Pipeline | |||
|---|---|---|---|---|---|---|---|
| KL | T IDE-Shared | T IDE-Cross | CALM | T IDE-Shared | |||
| GSM8K | 59.60 | 45.56 | 43.97 | 45.03 | 48.98 | 48.60 | 49.89 |
| MATH | 32.40 | 13.08 | 9.40 | 9.76 | 11.16 | 13.14 | 12.98 |
| BBH | 41.50 | 26.32 | 25.79 | 26.00 | 26.79 | 24.21 | 26.85 |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 12.88 | 14.48 | 13.47 | 14.02 |
| HellaSwag | 47.40 | 39.28 | 39.78 | 39.50 | 40.50 | 40.42 | 39.57 |
| MMLU | 52.80 | 39.15 | 39.57 | 39.09 | 39.92 | 39.42 | 39.54 |
| HumanEval | 32.30 | 46.34 | 41.46 | 42.68 | 48.78 | 43.90 | 49.39 |
| MBPP | 36.60 | 37.80 | 31.20 | 31.40 | 37.80 | 34.80 | 38.40 |
| Avg | 40.91 | 32.67 | 30.55 | 30.79 | 33.55 | 32.25 | 33.83 |
- Cross-Architecture Distillation Is Effective: Both T IDE pipelines outperform the non-distilled BD3LM baseline (Avg 32.67). The cross-tokenizer T IDE-Cross strategy achieves the highest average score (34.20), and shared-tokenizer T IDE-Shared reaches 33.55.
- Each Pipeline Favors Its Native Strategy:
- Cross-tokenizer pipeline prefers T IDE-Cross (Reverse CALM) by +0.37 avg, suited for alignment noise.
- Shared-tokenizer pipeline prefers T IDE-Shared (T IDAL + C OMP D EMO) by +2.76 avg, effective with exact token alignment.
- Distilled dLLMs Excel at Code Generation: On HumanEval, T IDE-Shared (shared) scores 48.78 and T IDE-Cross (cross) scores 48.17, substantially exceeding the AR baseline (32.30). This suggests parallel diffusion decoding benefits structured output generation.
Ablation Studies
Ablations on the shared-tokenizer pipeline (T IDE-Shared strategy) isolate component contributions.
Table 2: Component-level ablation on the shared-tokenizer pipeline (WeDLM → Qwen3-BD3LM). Bold: best per row.
| Benchmark | Baseline (w/o Train) | w/o Tstep | w/o C OMP D EMO | Full (T IDAL + C OMP D EMO) |
|---|---|---|---|---|
| GSM8K | 48.07 | 48.82 | 48.90 | 48.90 |
| MATH | 11.74 | 11.96 | 11.84 | 11.68 |
| BBH | 26.37 | 26.51 | 26.77 | 26.66 |
| MMLU-Pro | 14.12 | 14.42 | 13.76 | 13.76 |
| HellaSwag | 40.03 | 40.35 | 40.16 | 40.27 |
| MMLU | 39.81 | 39.84 | 39.58 | 39.92 |
| HumanEval | 45.73 | 43.90 | 44.51 | 46.95 |
| MBPP | 38.60 | 37.20 | 38.20 | 37.00 |
| Avg | 33.06 | 32.88 | 32.97 | 33.14 |
- Timestep Axis Is Most Impactful: Removing it causes the largest avg drop (-0.26), with a -3.05 drop on HumanEval, validating its necessity.
- C OMP D EMO Provides Consistent Gains: Removing it reduces avg by -0.17, with notable drops on HumanEval (-2.44) and MMLU (-0.34).
- Full Framework Outperforms Baseline: The complete T IDE (dual-axis + C OMP D EMO) outperforms the timestep-only baseline, stabilizing early training.
Inference Efficiency
Table 3: Inference efficiency comparison (controlled setting). Peak memory, latency, and throughput are measured on a single H100-80GB GPU generating 256 tokens in bfloat16.
| Model | Params (B) | Peak Mem (GB) | Latency (s) | Tokens/s |
|---|---|---|---|---|
| Student (BD3LM-0.6B) | ||||
| Distilled | 0.60 | 1.4 | 6.25 | 41.0 |
| No Distill | 0.60 | 1.4 | 6.08 | 42.1 |
| AR Baseline | ||||
| Qwen3-0.6B-Base | 0.60 | 1.2 | 4.99 | 51.3 |
| Teachers | ||||
| WeDLM-8B-Instruct | 8.19 | 15.5 | 6.79 | 37.7 |
| LLaDA2.0-mini | 16.26 | 31.3 | 32.55 | 7.8 |
- Distillation Enables Practical Deployment: The distilled student requires only 1.4 GB peak memory (22x reduction vs. LLaDA2's 31.3 GB) and is 5.2x faster (6.25s vs. 32.55s).
- Distillation Adds Minimal Overhead: Compared to the undistilled BD3LM, distillation introduces only a 2.6% throughput reduction (41.0 vs. 42.1 tokens/s) and identical memory footprint. The iterative diffusion process makes