# Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

> T IDE enables cross-architecture distillation for diffusion LLMs by introducing a timestep-dependent scheduler, complementary mask splitting, and a cross-tokenizer objective, yielding a 0.6B student that outperforms its baseline and runs 5x faster than its 16B teacher.

- **Source:** [arXiv](https://arxiv.org/abs/2604.26951)
- **Published:** 2026-05-01
- **Permalink:** https://picx.dev/p/Zjln2g
- **Whiteboard:** https://picx.dev/p/Zjln2g/image

## Summary

# Turning the T IDE: Cross-Architecture Distillation for Diffusion Large Language Models - Summary

## Summary (Overview)
*   **Pioneering Framework**: Introduces **T IDE**, the first unified framework for **cross-architecture knowledge distillation** for Diffusion Large Language Models (dLLMs), addressing heterogeneous transfer between models with different architectures, attention mechanisms, and tokenizers.
*   **Three Modular Components**: Proposes three novel components to overcome specific challenges:
    1.  **T IDAL**: A dual-axis scheduler modulating distillation strength based on both **training progress** and **diffusion timestep** to account for the teacher's noise-dependent reliability.
    2.  **C OMP D EMO**: Enriches teacher context via **complementary mask splitting**, providing better predictions under heavy masking.
    3.  **Reverse C ALM**: A cross-tokenizer objective using **inverted chunk-level likelihood matching** to yield bounded gradients and dual-end noise filtering.
*   **Effective Distillation**: Distilling 16B MoE and 8B dense teachers into a 0.6B student outperforms the non-distilled baseline by an average of **+1.53 points** across eight benchmarks and achieves a **+16.48 gain** on HumanEval code generation over a same-sized autoregressive model.
*   **Practical Efficiency**: The distilled 0.6B student requires **22x less memory** and runs **5x faster** than the 16B teacher, enabling deployment on commodity hardware with minimal inference overhead.

## Introduction and Theoretical Foundation
Diffusion Large Language Models (dLLMs) offer parallel decoding and bidirectional context as an alternative to dominant autoregressive (AR) models. However, state-of-the-art dLLMs require billions of parameters (e.g., 8B-100B), posing a significant deployment barrier. While knowledge distillation is well-established for AR models and existing dLLM distillation methods focus on *step compression* (reducing inference steps within the same architecture), **cross-architecture distillation** for dLLMs remains unexplored.

This setting introduces three fundamental challenges:
1.  **Temporal Inconsistency**: The teacher's reliability fluctuates drastically across the diffusion process (timestep-dependent).
2.  **Context Scarcity**: Severe masking at high noise levels reduces available context, making teacher predictions uninformative.
3.  **Vocabulary Misalignment**: Distinct tokenizer vocabularies render standard token-level likelihood objectives inapplicable.

**T IDE** is proposed as a unified framework to overcome these temporal, spatial, and vocabulary barriers through three synergistic components that orchestrate an end-to-end learning pipeline.

## Methodology
The T IDE framework distills a large teacher dLLM $f_T$ (params $\theta_T$) into a smaller student $f_S$ (params $\theta_S$), where they may differ in architecture, attention, and tokenizer. Let $x = (x_1, ..., x_L)$ be a clean token sequence and $x_t$ be the noised version at diffusion timestep $t \in [\epsilon, 1)$, with masked positions $M$ replaced by `[MASK]`.

### 1. Time-Iteration Dual-Axis Lambda Modulation (T IDAL)
This component dynamically modulates distillation strength along two axes to handle temporal inconsistency.

*   **Axis 1: Diffusion Timestep**: Modulates based on teacher reliability at noise level $t$.
    $$ \lambda_t = \lambda_{train} \times (1 - t) $$
    At high noise ($t \approx 1$), $\lambda_t \approx 0$, avoiding unreliable teacher signals. At low noise ($t \approx 0$), $\lambda_t \approx \lambda_{train}$, fully relying on the teacher.
*   **Axis \(2\): Training Progress**: The base coefficient $\lambda_{train}$ follows a cosine schedule over normalized progress $p \in [0, 1]$.
    $$ \lambda_{train} = \lambda_{init} + (\lambda_{max} - \lambda_{init}) \times \frac{1}{2}(1 - \cos(\pi \cdot p)) $$
    Defaults: $\lambda_{init}=0.1$, $\lambda_{max}=0.9$. Early training is student-dominated to prevent collapse; later stages shift to teacher supervision.
*   **Interpolated Target and Loss**: Given student logits $s$ and teacher logits $t$ at masked positions, the interpolated target $r_t$ is:
    $$ r_t = \text{softmax}\left( \frac{(1 - \lambda_t) \cdot s + \lambda_t \cdot t}{T} \right) $$
    The **T IDAL loss** is then:
    $$ \mathcal{L}_{\text{TIDAL}} = D_{KL}\left( r_t \parallel \text{softmax}\left( \frac{s}{T} \right) \right) \times T^2 $$
    $r_t$ is detached from the graph. An optional midrange timestep weighting $w(t) = \exp\left(-\frac{(t-0.5)^2}{2\sigma^2}\right)$ with $\sigma=0.15$ can be applied.

### 2. Complementary Demonstration (C OMP D EMO)
This component enriches teacher context to overcome spatial scarcity under heavy masking.

*   **Mask Splitting**: The masked set $M$ is randomly partitioned into two complementary subsets $M_A$ and $M_B$ such that:
    $$ M_A \cup M_B = M,\quad M_A \cap M_B = \emptyset,\quad |M_A| / |M| \approx \rho $$
    where the demonstration ratio $\rho = 0.5$.
*   **Two-Pass Teacher Inference**: Perform two forward passes through the frozen teacher:
    *   **Pass 1**: $t^{(1)} = f_T(\text{reveal } M_A, \text{ mask } M_B) \rightarrow$ logits at $M_B$
    *   **Pass 2**: $t^{(2)} = f_T(\text{reveal } M_B, \text{ mask } M_A) \rightarrow$ logits at $M_A$
    The merged final logits are: $t_{\text{final}}[M_B] \leftarrow t^{(1)}[M_B]$ and $t_{\text{final}}[M_A] \leftarrow t^{(2)}[M_A]$.
*   This strategy doubles teacher forward passes but increases total training time by only ~50% as the teacher is frozen.

### 3. Distillation Objectives & Reverse CALM
T IDE supports two pipelines with tailored objectives.

*   **Shared-Tokenizer Objective** (WeDLM → BD3LM): Teacher and student share tokenizer. The loss combines Cross-Entropy (CE) and T IDAL:
    $$ \mathcal{L}_B = \mathcal{L}_{\text{CE}} + w_{\text{tidal}} \cdot \mathcal{L}_{\text{TIDAL}} $$
    C OMP D EMO can be optionally integrated.

*   **Cross-Tokenizer Objective** (LLaDA2 → BD3LM): Teacher and student have different vocabularies ($V_T \neq V_S$). **Chunk-level Approximate Likelihood Matching (C ALM)** is introduced, aligning token sequences at the byte level to identify *chunks*.
    *   **Chunk-Level Log-Probabilities**: For each token $x_i$, $\log P(x_i) = \text{logits}_{x_i} - \text{logsumexp}(\text{logits})$. Chunk-level log-probabilities are obtained via alignment matrices $A_S$ and $A_T$:
        $$ \text{LP}_S = \text{lp}_S \cdot A_S \in \mathbb{R}^{b \times C}, \quad \text{LP}_T = \text{lp}_T \cdot A_T \in \mathbb{R}^{b \times C} $$
        Chunk probabilities: $p^c_s = \exp(\text{LP}^c_S / T)$ and $p^c_t = \exp(\text{LP}^c_T / T)$.
    *   **Forward C ALM (Baseline)**: Applies Binary Cross-Entropy (BCE):
        $$ \mathcal{L}_{\text{Fwd-CALM}} = -[p^c_t \log p^c_s + (1 - p^c_t) \log(1 - p^c_s)] $$
        Can be integrated with T IDAL: $p_{\text{mix}} = (1 - \lambda_t) \cdot p^c_s + \lambda_t \cdot p^c_t$.
    *   **Reverse C ALM (Proposed)**: To address gradient explosion ($p^c_t / p^c_s$ divergence) in Forward CALM, the BCE direction is reversed:
        $$ \mathcal{L}_{\text{Rev-CALM}} = -[p^c_s \log p^c_t + (1 - p^c_s) \log(1 - p^c_t)] $$
        This yields bounded gradients (coefficient $\log \frac{p^c_t}{1 - p^c_t}$ depends only on fixed teacher) and provides **dual-end noise filtering**. It is equivalent to minimizing the Bernoulli KL divergence $KL_{\text{Bern}}(p^c_s \parallel p^c_t)$.
    *   The cross-tokenizer training objective is:
        $$ \mathcal{L}_A = \mathcal{L}_{\text{CE}} + w_{\text{calm}} \cdot \mathcal{L}_{\text{dist}}, \quad \text{where } \mathcal{L}_{\text{dist}} \in \{\mathcal{L}_{\text{CALM-TIDAL}}, \mathcal{L}_{\text{Rev-CALM}}\} $$

## Empirical Validation / Results
### Experimental Setup
*   **Student Model**: Qwen3-0.6B-BD3LM (0.6B block diffusion model).
*   **Teacher Models**:
    *   **Pipeline A (Cross-Tokenizer)**: LLaDA2.0-mini (16B MoE, independent tokenizer).
    *   **Pipeline B (Shared-Tokenizer)**: WeDLM-8B-Instruct (8B dense, shared Qwen tokenizer).
*   **Training**: 10 epochs, LR=5e-5, sequence length 512, on combined SFT datasets (Tulu-3, SmolTalk, OpenCoder).
*   **Evaluation**: 8 benchmarks: GSM8K, MATH, BBH, MMLU-Pro, HellaSwag, MMLU, HumanEval, MBPP.
*   **Baselines**: AR model (Qwen3-0.6B-Base) and non-distilled BD3LM from Zhou et al. (2026).

### Main Results

**Table 1: Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. *Bold*: best among dLLM models; *underline*: second best.**

| Benchmark | AR (Qwen3-0.6B) | BD3LM (No Distill) | Shared-Tokenizer Pipeline | | Cross-Tokenizer Pipeline | | |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| | | | **KL** | **T IDE-Shared** | **T IDE-Cross** | **CALM** | **T IDE-Shared** | **T IDE-Cross** |
| GSM8K | 59.60 | 45.56 | 43.97 | 45.03 | **48.98** | 48.60 | 49.89 | **52.24** |
| MATH | 32.40 | 13.08 | 9.40 | 9.76 | 11.16 | **13.14** | 12.98 | **13.20** |
| BBH | 41.50 | 26.32 | 25.79 | 26.00 | 26.79 | 24.21 | **26.85** | **27.37** |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 12.88 | **14.48** | 13.47 | 14.02 | **14.52** |
| HellaSwag | 47.40 | 39.28 | 39.78 | 39.50 | **40.50** | 40.42 | 39.57 | **39.88** |
| MMLU | 52.80 | 39.15 | 39.57 | 39.09 | **39.92** | 39.42 | 39.54 | **39.59** |
| HumanEval | 32.30 | 46.34 | 41.46 | 42.68 | **48.78** | 43.90 | **49.39** | 48.17 |
| MBPP | 36.60 | 37.80 | 31.20 | 31.40 | **37.80** | 34.80 | 38.40 | **38.60** |
| **Avg** | **40.91** | **32.67** | **30.55** | **30.79** | **33.55** | **32.25** | **33.83** | **34.20** |

*   **Cross-Architecture Distillation Is Effective**: Both T IDE pipelines outperform the non-distilled BD3LM baseline (Avg 32.67). The **cross-tokenizer T IDE-Cross** strategy achieves the highest average score (**34.20**), and **shared-tokenizer T IDE-Shared** reaches **33.55**.
*   **Each Pipeline Favors Its Native Strategy**:
    *   Cross-tokenizer pipeline prefers **T IDE-Cross** (Reverse CALM) by **+0.37 avg**, suited for alignment noise.
    *   Shared-tokenizer pipeline prefers **T IDE-Shared** (T IDAL + C OMP D EMO) by **+2.76 avg**, effective with exact token alignment.
*   **Distilled dLLMs Excel at Code Generation**: On HumanEval, T IDE-Shared (shared) scores **48.78** and T IDE-Cross (cross) scores **48.17**, substantially exceeding the AR baseline (**32.30**). This suggests parallel diffusion decoding benefits structured output generation.

### Ablation Studies
Ablations on the shared-tokenizer pipeline (T IDE-Shared strategy) isolate component contributions.

**Table 2: Component-level ablation on the shared-tokenizer pipeline (WeDLM → Qwen3-BD3LM). *Bold*: best per row.**

| Benchmark | Baseline (w/o Train) | w/o Tstep | w/o C OMP D EMO | **Full** (T IDAL + C OMP D EMO) |
| :--- | :---: | :---: | :---: | :---: |
| GSM8K | 48.07 | 48.82 | 48.90 | **48.90** |
| MATH | 11.74 | 11.96 | 11.84 | **11.68** |
| BBH | 26.37 | 26.51 | 26.77 | **26.66** |
| MMLU-Pro | 14.12 | 14.42 | 13.76 | **13.76** |
| HellaSwag | 40.03 | 40.35 | 40.16 | **40.27** |
| MMLU | 39.81 | 39.84 | 39.58 | **39.92** |
| HumanEval | 45.73 | 43.90 | 44.51 | **46.95** |
| MBPP | 38.60 | 37.20 | 38.20 | **37.00** |
| **Avg** | **33.06** | **32.88** | **32.97** | **33.14** |

*   **Timestep Axis Is Most Impactful**: Removing it causes the largest avg drop (**-0.26**), with a **-3.05** drop on HumanEval, validating its necessity.
*   **C OMP D EMO Provides Consistent Gains**: Removing it reduces avg by **-0.17**, with notable drops on HumanEval (**-2.44**) and MMLU (**-0.34**).
*   **Full Framework Outperforms Baseline**: The complete T IDE (dual-axis + C OMP D EMO) outperforms the timestep-only baseline, stabilizing early training.

### Inference Efficiency

**Table 3: Inference efficiency comparison (controlled setting). Peak memory, latency, and throughput are measured on a single H100-80GB GPU generating 256 tokens in bfloat16.**

| Model | Params (B) | Peak Mem (GB) | Latency (s) | Tokens/s |
| :--- | :---: | :---: | :---: | :---: |
| **Student (BD3LM-0.6B)** | | | | |
| Distilled | 0.60 | **1.4** | 6.25 | 41.0 |
| No Distill | 0.60 | **1.4** | 6.08 | 42.1 |
| **AR Baseline** | | | | |
| Qwen3-0.6B-Base | 0.60 | 1.2 | 4.99 | **51.3** |
| **Teachers** | | | | |
| WeDLM-8B-Instruct | 8.19 | 15.5 | 6.79 | 37.7 |
| LLaDA2.0-mini | 16.26 | **31.3** | **32.55** | **7.8** |

*   **Distillation Enables Practical Deployment**: The distilled student requires only **1.4 GB** peak memory (**22x reduction** vs. LLaDA2's 31.3 GB) and is **5.2x faster** (6.25s vs. 32.55s).
*   **Distillation Adds Minimal Overhead**: Compared to the undistilled BD3LM, distillation introduces only a **2.6%** throughput reduction (41.0 vs. 42.1 tokens/s) and identical memory footprint. The iterative diffusion process makes

---

_Markdown view of https://picx.dev/p/Zjln2g, served by PicX — AI-generated visual whiteboard summaries of research papers._