Summary (Overview)

  • iLLaDA is an 8B masked diffusion language model trained from scratch with fully bidirectional attention, scaling pre‑training to 12T tokens and SFT to 25B tokens for 12 epochs.
  • Compared with the previous diffusion model LLaDA, iLLaDA achieves large improvements across general, mathematical, and code benchmarks (e.g., +21.6 points on BBH, +14.9 on ARC‑Challenge for the base model; +14.5 on MATH, +16.5 on HumanEval for the instruct model).
  • iLLaDA adopts grouped‑query attention, tied input/output embeddings, variable‑length generation, and a confidence‑based scoring rule for multiple‑choice evaluation.
  • Despite its non‑autoregressive formulation, iLLaDA‑Base is competitive with Qwen2.5 7B on several benchmarks (best results on MMLU, BBH, ARC‑C, GSM8K), while iLLaDA‑Instruct narrows the gap with strong autoregressive instruct models.

Introduction and Theoretical Foundation

Modern large language models are predominantly trained with autoregressive factorization and causal attention. Diffusion language models, particularly those using the masked diffusion formulation, offer a fundamentally different approach by employing fully bidirectional attention. Prior work (LLaDA) demonstrated that such models can acquire core LLM capabilities (in‑context learning, instruction following) and have shown advantages in bidirectional reasoning, long‑horizon planning, and multimodal modeling. However, LLaDA’s performance remained behind strong autoregressive models like Qwen2 and Qwen2.5, leaving substantial room for improvement.

The theoretical basis of iLLaDA is the masked diffusion objective for discrete data, which is a likelihood‑based loss that masks tokens with a random ratio tU[0,1]t \sim U[0,1] and trains the model to predict only the masked tokens:

L(θ)Et,x0,xt[1ti=1L1[xit=M]logpθ(xi0xt)].(1)\mathcal{L}(\theta) \triangleq -\mathbb{E}_{t, x_0, x_t}\left[\frac{1}{t}\sum_{i=1}^L \mathbf{1}[x_i^t = M] \log p_\theta(x_i^0 \mid x_t)\right]. \tag{1}

This differs from fixed‑ratio masked language modeling; the loss is computed only over tokens that have been replaced with the special mask token MM.

Methodology

Pre‑training

iLLaDA retains the same masked diffusion objective as LLaDA (Eq. 1). Architectural changes include:

  • Grouped‑query attention (GQA) instead of multi‑head attention, reducing memory footprint when caching key/value states.
  • Tied input embedding and LM‑head parameters, controlling parameter count.
  • Random‑length training (30% probability of splitting an 8192‑token sequence into two shorter segments) and flash‑attention‑based variable‑length kernels to avoid padding.
  • Learning rate schedule: linear warmup to 2×1042\times10^{-4}, held constant until loss plateaus, then cosine decay to 5×1065\times10^{-6}.
  • AdamW optimizer with weight decay 0.1.

Architecture details are given in Table 1.

Table 1 Architecture comparison between iLLaDA and LLaDA.

PropertyiLLaDA 8BLLaDA 8B
Layers3232
Model dimension40964096
Attention heads3232
Key/Value heads832
FFN dimension14,33612,288
Vocabulary size155,136126,464
Maximum sequence length81924096
Embedding and LM‑headTiedUntied
Total parameters7.62B8.02B

Supervised Fine‑Tuning (SFT)

iLLaDA uses the same data processing and masking scheme as pre‑training: each instruction example is formatted as a prompt‑response sequence with a terminal |EOS| token; all formatted examples are concatenated into a continuous corpus from which 8192‑token training sequences are sampled and random masks are applied to the entire sequence (including prompt tokens). This differs from prior work that only masked the response region. The SFT corpus contains ≈25B tokens, fine‑tuned for 12 epochs. The learning rate is warmed up to 5×1065\times10^{-6}, held constant, then linearly decayed to 5×1075\times10^{-7}.

Inference

For open‑ended generation, iLLaDA uses variable‑length generation: a block of mask tokens is appended to the prompt, and the diffusion sampler iteratively predicts masked positions, transferring the most confident predictions to visible tokens while remasking low‑confidence ones. New blocks are added until a stop token or maximum length is reached.

For multiple‑choice evaluation, a confidence‑based scoring rule is used:

ik=argmaxiMk1pθ(yip,y~k1),Sconf(yp)=k=1Llogpθ(yikp,y~k1),(2)i_k = \arg\max_{i \in \mathcal{M}_{k-1}} p_\theta(y_i \mid p, \tilde{y}^{k-1}), \qquad S_{\text{conf}}(y \mid p) = \sum_{k=1}^L \log p_\theta(y_{i_k} \mid p, \tilde{y}^{k-1}), \tag{2}

where y~k1\tilde{y}^{k-1} contains the ground‑truth tokens already revealed and masks elsewhere. This score is a task‑specific surrogate, not a likelihood estimate.

Empirical Validation / Results

Base Model Results (Table 2)

iLLaDA 8B is compared with LLaDA 8B, Dream 7B (diffusion fine‑tuned from Qwen2.5), and Qwen2.5 7B (autoregressive). Key gains over LLaDA: +21.6 on BBH, +14.9 on ARC‑C, +11.6 on GSM8K, +14.6 on HumanEval. iLLaDA achieves the best or second‑best average among the models reported.

Table 2 Benchmark results of base models.

TaskiLLaDA 8BLLaDA 8BDream 7BQwen2.5 7B
MMLU74.865.969.571.9
BBH71.349.757.963.9
ARC‑C60.845.959.851.5
HellaSwag76.670.573.379.0
GSM8K81.970.377.278.9
MATH38.431.439.641.1
HumanEval50.035.457.956.7
MBPP57.840.056.263.6
Avg.63.951.161.463.3

Instruct Model Results (Table 3)

iLLaDA‑Instruct improves over LLaDA‑Instruct by large margins (e.g., +14.3 on GSM8K, +14.5 on MATH, +16.5 on HumanEval) and outperforms Dream 7B on most tasks. It lags behind Qwen2.5 7B Instruct on some math and code benchmarks, but achieves competitive results on MMLU‑Redux.

Table 3 Benchmark results of instruct models.

TaskiLLaDA 8BLLaDA 8BDream 7BQwen2.5 7B
MMLU71.665.567.076.6
MMLU‑Pro52.337.043.356.3
MMLU‑Redux76.468.976.375.7
GSM8K89.077.581.091.6
MATH56.742.239.275.5
HumanEval65.949.455.584.8
MBPP58.041.058.879.2
Avg.67.154.560.277.1

Ablation Studies

Multiple‑choice scoring (Table 4). Confidence‑based scoring consistently outperforms likelihood‑style scoring across three tasks: gains of 1.3 on PIQA, 0.6 on ARC‑C, and 2.3 on HellaSwag.

Table 4 Ablation of multiple‑choice scoring rules.

Scoring rulePIQAARC‑CHellaSwag
Likelihood77.260.274.3
Confidence78.560.876.6

SFT epoch duration (Figure 1). Performance on GSM8K, MATH, and MMLU‑Pro continues to improve up to 12 epochs, supporting the use of long SFT. This aligns with findings that diffusion models benefit from repeated data exposure.

Theoretical and Practical Implications

The results demonstrate that fully bidirectional diffusion training from scratch is a competitive path toward strong language models, challenging the prevailing assumption that autoregressive modeling is necessary for high‑performance LLMs. iLLaDA’s success shows that diffusion language models can scale to 12T tokens and achieve performance on par with strong autoregressive baselines like Qwen2.5 7B on several benchmarks.

Practical implications:

  • Architectural choices (GQA, tied embeddings, variable‑length training) are important for scaling diffusion LMs efficiently.
  • SFT with the same masked diffusion objective, applied over multiple epochs, is beneficial and does not lead to overfitting in the studied regime.
  • Confidence‑based scoring is a simple and effective method for multiple‑choice evaluation with diffusion models.

Limitations: iLLaDA has not been further aligned with reinforcement learning, which likely explains the remaining gap in the instruct setting. The study is limited to the 8B scale; larger scales may yield further gains.

Conclusion

iLLaDA, an 8B fully bidirectional masked diffusion language model trained from scratch on 12T tokens, substantially improves over prior diffusion language models (LLaDA, Dream) across general, mathematical, and code benchmarks. Its base model is competitive with Qwen2.5 7B, narrowing the gap between diffusion LMs and strong autoregressive models. The paper demonstrates that careful scaling of pre‑training and SFT, along with practical design choices like confidence‑based scoring and variable‑length generation, can yield strong performance from a non‑autoregressive architecture. Future work should explore reinforcement learning alignment (e.g., VRPO, diffu‑GRPO, MDPO, ESPO) and larger‑scale studies.

Related papers