Visual Summary | Improved Large Language Diffusion Models

Summary (Overview)

iLLaDA is an 8B masked diffusion language model trained from scratch with fully bidirectional attention, scaling pre‑training to 12T tokens and SFT to 25B tokens for 12 epochs.
Compared with the previous diffusion model LLaDA, iLLaDA achieves large improvements across general, mathematical, and code benchmarks (e.g., +21.6 points on BBH, +14.9 on ARC‑Challenge for the base model; +14.5 on MATH, +16.5 on HumanEval for the instruct model).
iLLaDA adopts grouped‑query attention, tied input/output embeddings, variable‑length generation, and a confidence‑based scoring rule for multiple‑choice evaluation.
Despite its non‑autoregressive formulation, iLLaDA‑Base is competitive with Qwen2.5 7B on several benchmarks (best results on MMLU, BBH, ARC‑C, GSM8K), while iLLaDA‑Instruct narrows the gap with strong autoregressive instruct models.

Introduction and Theoretical Foundation

Modern large language models are predominantly trained with autoregressive factorization and causal attention. Diffusion language models, particularly those using the masked diffusion formulation, offer a fundamentally different approach by employing fully bidirectional attention. Prior work (LLaDA) demonstrated that such models can acquire core LLM capabilities (in‑context learning, instruction following) and have shown advantages in bidirectional reasoning, long‑horizon planning, and multimodal modeling. However, LLaDA’s performance remained behind strong autoregressive models like Qwen2 and Qwen2.5, leaving substantial room for improvement.

The theoretical basis of iLLaDA is the masked diffusion objective for discrete data, which is a likelihood‑based loss that masks tokens with a random ratio $t \sim U[0,1]$ and trains the model to predict only the masked tokens:

\mathcal{L}(\theta) \triangleq -\mathbb{E}_{t, x_0, x_t}\left[\frac{1}{t}\sum_{i=1}^L \mathbf{1}[x_i^t = M] \log p_\theta(x_i^0 \mid x_t)\right]. \tag{1}

This differs from fixed‑ratio masked language modeling; the loss is computed only over tokens that have been replaced with the special mask token $M$ .

Methodology

Pre‑training

iLLaDA retains the same masked diffusion objective as LLaDA (Eq. 1). Architectural changes include:

Grouped‑query attention (GQA) instead of multi‑head attention, reducing memory footprint when caching key/value states.
Tied input embedding and LM‑head parameters, controlling parameter count.
Random‑length training (30% probability of splitting an 8192‑token sequence into two shorter segments) and flash‑attention‑based variable‑length kernels to avoid padding.
Learning rate schedule: linear warmup to $2\times10^{-4}$ , held constant until loss plateaus, then cosine decay to $5\times10^{-6}$ .
AdamW optimizer with weight decay 0.1.

Architecture details are given in Table 1.

Table 1 Architecture comparison between iLLaDA and LLaDA.

Property	iLLaDA 8B	LLaDA 8B
Layers	32	32
Model dimension	4096	4096
Attention heads	32	32
Key/Value heads	8	32
FFN dimension	14,336	12,288
Vocabulary size	155,136	126,464
Maximum sequence length	8192	4096
Embedding and LM‑head	Tied	Untied
Total parameters	7.62B	8.02B

Supervised Fine‑Tuning (SFT)

iLLaDA uses the same data processing and masking scheme as pre‑training: each instruction example is formatted as a prompt‑response sequence with a terminal |EOS| token; all formatted examples are concatenated into a continuous corpus from which 8192‑token training sequences are sampled and random masks are applied to the entire sequence (including prompt tokens). This differs from prior work that only masked the response region. The SFT corpus contains ≈25B tokens, fine‑tuned for 12 epochs. The learning rate is warmed up to $5\times10^{-6}$ , held constant, then linearly decayed to $5\times10^{-7}$ .

Inference

For open‑ended generation, iLLaDA uses variable‑length generation: a block of mask tokens is appended to the prompt, and the diffusion sampler iteratively predicts masked positions, transferring the most confident predictions to visible tokens while remasking low‑confidence ones. New blocks are added until a stop token or maximum length is reached.

For multiple‑choice evaluation, a confidence‑based scoring rule is used:

i_k = \arg\max_{i \in \mathcal{M}_{k-1}} p_\theta(y_i \mid p, \tilde{y}^{k-1}), \qquad S_{\text{conf}}(y \mid p) = \sum_{k=1}^L \log p_\theta(y_{i_k} \mid p, \tilde{y}^{k-1}), \tag{2}

where $\tilde{y}^{k-1}$ contains the ground‑truth tokens already revealed and masks elsewhere. This score is a task‑specific surrogate, not a likelihood estimate.

Empirical Validation / Results

Base Model Results (Table 2)

iLLaDA 8B is compared with LLaDA 8B, Dream 7B (diffusion fine‑tuned from Qwen2.5), and Qwen2.5 7B (autoregressive). Key gains over LLaDA: +21.6 on BBH, +14.9 on ARC‑C, +11.6 on GSM8K, +14.6 on HumanEval. iLLaDA achieves the best or second‑best average among the models reported.

Table 2 Benchmark results of base models.

Task	iLLaDA 8B	LLaDA 8B	Dream 7B	Qwen2.5 7B
MMLU	74.8	65.9	69.5	71.9
BBH	71.3	49.7	57.9	63.9
ARC‑C	60.8	45.9	59.8	51.5
HellaSwag	76.6	70.5	73.3	79.0
GSM8K	81.9	70.3	77.2	78.9
MATH	38.4	31.4	39.6	41.1
HumanEval	50.0	35.4	57.9	56.7
MBPP	57.8	40.0	56.2	63.6
Avg.	63.9	51.1	61.4	63.3

Instruct Model Results (Table 3)

iLLaDA‑Instruct improves over LLaDA‑Instruct by large margins (e.g., +14.3 on GSM8K, +14.5 on MATH, +16.5 on HumanEval) and outperforms Dream 7B on most tasks. It lags behind Qwen2.5 7B Instruct on some math and code benchmarks, but achieves competitive results on MMLU‑Redux.

Table 3 Benchmark results of instruct models.

Task	iLLaDA 8B	LLaDA 8B	Dream 7B	Qwen2.5 7B
MMLU	71.6	65.5	67.0	76.6
MMLU‑Pro	52.3	37.0	43.3	56.3
MMLU‑Redux	76.4	68.9	76.3	75.7
GSM8K	89.0	77.5	81.0	91.6
MATH	56.7	42.2	39.2	75.5
HumanEval	65.9	49.4	55.5	84.8
MBPP	58.0	41.0	58.8	79.2
Avg.	67.1	54.5	60.2	77.1

Ablation Studies

Multiple‑choice scoring (Table 4). Confidence‑based scoring consistently outperforms likelihood‑style scoring across three tasks: gains of 1.3 on PIQA, 0.6 on ARC‑C, and 2.3 on HellaSwag.

Table 4 Ablation of multiple‑choice scoring rules.

Scoring rule	PIQA	ARC‑C	HellaSwag
Likelihood	77.2	60.2	74.3
Confidence	78.5	60.8	76.6

SFT epoch duration (Figure 1). Performance on GSM8K, MATH, and MMLU‑Pro continues to improve up to 12 epochs, supporting the use of long SFT. This aligns with findings that diffusion models benefit from repeated data exposure.

Theoretical and Practical Implications

The results demonstrate that fully bidirectional diffusion training from scratch is a competitive path toward strong language models, challenging the prevailing assumption that autoregressive modeling is necessary for high‑performance LLMs. iLLaDA’s success shows that diffusion language models can scale to 12T tokens and achieve performance on par with strong autoregressive baselines like Qwen2.5 7B on several benchmarks.

Practical implications:

Architectural choices (GQA, tied embeddings, variable‑length training) are important for scaling diffusion LMs efficiently.
SFT with the same masked diffusion objective, applied over multiple epochs, is beneficial and does not lead to overfitting in the studied regime.
Confidence‑based scoring is a simple and effective method for multiple‑choice evaluation with diffusion models.

Limitations: iLLaDA has not been further aligned with reinforcement learning, which likely explains the remaining gap in the instruct setting. The study is limited to the 8B scale; larger scales may yield further gains.

Conclusion

iLLaDA, an 8B fully bidirectional masked diffusion language model trained from scratch on 12T tokens, substantially improves over prior diffusion language models (LLaDA, Dream) across general, mathematical, and code benchmarks. Its base model is competitive with Qwen2.5 7B, narrowing the gap between diffusion LMs and strong autoregressive models. The paper demonstrates that careful scaling of pre‑training and SFT, along with practical design choices like confidence‑based scoring and variable‑length generation, can yield strong performance from a non‑autoregressive architecture. Future work should explore reinforcement learning alignment (e.g., VRPO, diffu‑GRPO, MDPO, ESPO) and larger‑scale studies.