Rethinking Cross-Layer Information Routing in Diffusion Transformers

Summary (Overview)

Problem Identification: The paper identifies three concrete symptoms of the standard residual addition inherited from Transformers in Diffusion Transformers (DiTs): monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy across depth and denoising timestep.
Proposed Solution: It introduces Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs.
Key Results: On ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67) and matches the baseline's converged quality with 8.75× fewer training iterations. When stacked with REPA, it yields a 2× training acceleration in the early stage.
Orthogonal Contribution: The gains from DAR are orthogonal to existing representation-alignment objectives (e.g., REPA), positioning cross-layer information routing as a new, complementary design axis for diffusion models.

Introduction and Theoretical Foundation

Diffusion Transformers (DiTs) have become the dominant backbone for modern visual generation. While nearly every component—tokenization, attention, conditioning, objectives, and latent autoencoders—has been extensively revisited, the residual stream governing cross-layer information accumulation has been directly inherited from the original Transformer. This work argues that this default design is poorly suited for the time-varying dynamics of the denoising process.

The paper is motivated by two key insights:

Diagnostic Insight: The standard pre-normalized residual stream in DiTs exhibits symptoms analogous to the "PreNorm dilution" phenomenon observed in LLMs, which intensify with depth: hidden-state magnitudes inflate, gradients decay, and adjacent blocks become redundant.
Architectural Insight: The denoising timestep (t)—the core dimension distinguishing DiTs from standard Transformers—should play a vital role in how information is routed across layers. As denoising progresses from high to low noise, the most relevant intermediate features shift from coarse structure to fine details, necessitating adaptive, time-aware aggregation.

The goal is to elevate cross-layer information routing from an inherited convention to an explicit, optimized design axis for DiTs.

Methodology

The paper proposes Diffusion-Adaptive Routing (DAR), which replaces the standard fixed residual addition with a learned, timestep-aware aggregation mechanism.

Standard Residual Routing in DiTs

The standard update for sublayer l is:

h_{l+1} = h_l + f_l(h_l; t)

Unrolling the recurrence gives the accumulated information:

h_l = h_0 + \sum_{i=0}^{l-1} f_i(h_i; t)

This represents a fixed routing pattern where all previous outputs enter the stream with unit coefficients.

DAR Formulation

Let v_i = f_i(h_i; t) denote the output of the i- th sublayer, with v_0 = h_0. DAR replaces the unweighted sum with a softmax-weighted aggregation:

h_l = \sum_{i=0}^{l-1} \alpha_{i \to l}(t) v_i

where the routing weights are computed via attention:

\alpha_{i \to l}(t) = \frac{\exp\left(q_l(t)^\top k_i / \sqrt{d}\right)}{\sum_{j=0}^{l-1} \exp\left(q_l(t)^\top k_j / \sqrt{d}\right)}

Here, k_i = \text{RMSNorm}(v_i) is the key for source v_i. The aggregated h_l then enters the next sublayer transformation.

Key Design Choices

Query Parameterization: The per-layer query q_l(t) can be:
- Static: q_l(t) = w_l (a learnable vector).
- Dynamic: q_l(t) = W_q^{(l)} v_{l-1} (projection from previous output).
- Explicit Timestep Injection: q_l(t) = w_l + e(t) (reusing DiT's timestep embedding).
Chunked Aggregation: To reduce memory overhead from storing all L sources, sublayers are partitioned into N chunks of size S = L/N. The source set for layer l in chunk n becomes: $S_l = \underbrace{\{c_0, c_1, \dots, c_{n-1}\}}_{\text{prior chunk summaries}} \cup \underbrace{\{v_{(n-1)S+1}, \dots, v_{l-1}\}}_{\text{current intra-chunk sources}}$ where c_n := v_{nS} is the summary of chunk n.

DAR preserves the isotropic, homogeneous Transformer stack and is compatible with modern enhancements like REPA.

Empirical Validation / Results

Main Results on ImageNet 256×256

The table below shows a system-level comparison. DAR variants achieve better FID with significantly fewer training iterations than the SiT baseline and outperform other routing methods like U-Net-like skip connections.

Method	Iters.	Params	w/o guidance FID ↓	w/ guidance FID ↓
Standard Residuals
SiT ode	1.75M	675M	9.67	2.15
SiT-Plus ode	1M	752M	10.85	2.36
U-Net-Like Routing
U-DiT-L sde	250K	810M	7.54	3.00
Our Method (DAR)
Static c4 ode	600K	675M	7.56	2.08
Dynamic c4 ode	500K	751M	8.07	2.05

Table 1: System-level comparison on ImageNet 256×256. 'c4' denotes a chunk size of 4.

Faster Convergence: DAR static variant matches the baseline's converged quality (∼9.67 FID) in roughly 8.75× fewer iterations.
Superior Final Quality: DAR achieves a best FID of 6.92 (SDE, no CFG), a 2.11 improvement over the SiT baseline at matched compute.

Ablation Studies and Analysis

Timestep Awareness is Crucial: Ablations show that both timestep-aware query variants (dynamic and explicit injection) substantially outperform the timestep-blind (pure static) variant.

Method	100K	200K	400K
Static w/o t-injection	22.36	15.47	11.51
Dynamic	13.95	9.29	8.10
Static w/ t-injection	17.39	10.12	7.97

Table 2: Ablation on timestep awareness in DAR (FID ↓ at different iterations).

Orthogonality to REPA: DAR's gains compound with those from the representation-alignment objective REPA, confirming they operate on orthogonal axes.

Method	100K	200K	300K
SiT + REPA	9.89	6.89	6.29
DAR + REPA	7.09	5.92	5.68

Table 3: Compatibility with REPA (FID ↓ at different iterations).

Optimal Chunk Size: A sweep of chunk size S reveals a U-shaped performance pattern, with S=4 being optimal for SiT-XL/2 (L=56). This is predicted by a rate-distortion model where the optimal S* scales with √L.

Chunk size `S`	1	4	8
FID ↓	10.41	8.39	11.14
IS ↑	107.2	121.7	103.51

Table 4: Effects of chunk size S on SiT-XL/2 (300K iterations).

Application to Large-Scale T2I: When applied during Distribution Matching Distillation (DMD) of Qwen-Image, DAR helps preserve high-frequency details (sharp edges, fine textures) that are often attenuated during aggressive few-step distillation.

Theoretical and Practical Implications

New Design Axis: The work establishes cross-layer information routing as a significant and previously underexplored architectural dimension for improving diffusion models, operating orthogonally to advances in conditioning, objectives, and backbone scaling.
Diagnostic Framework: The three identified symptoms (magnitude inflation, gradient decay, redundancy) provide a concrete diagnostic framework for analyzing information flow in deep generative Transformers.
Practical Benefits: DAR offers drop-in compatibility with existing DiT architectures and training methods (e.g., REPA), leading to substantial improvements in training efficiency (faster convergence, fewer iterations) and final output quality (lower FID, better detail preservation).
Theoretical Underpinning: The optimal chunk size analysis (S* = √[L * (1-α)/(1+α)]) provides a principled guideline for scaling DAR to deeper models, suggesting its benefits may widen with increasing model depth.

Conclusion

This paper presents a systematic diagnosis of cross-layer information flow in Diffusion Transformers, identifying key limitations of the standard residual stream. In response, it proposes Diffusion-Adaptive Routing (DAR), a novel mechanism that enables learnable, timestep-conditioned aggregation across layers.

Empirical results demonstrate that DAR significantly accelerates training and improves final generation quality on ImageNet. Its orthogonality to representation-alignment objectives like REPA highlights cross-layer routing as a promising new direction for architectural innovation in diffusion modeling.

Future work involves scaling DAR to multi-billion parameter T2I and T2V backbones and exploring its benefits across a broader range of post-training objectives (fine-tuning, preference optimization, distillation).