Rethinking Cross-Layer Information Routing in Diffusion Transformers
Summary (Overview)
- Problem Identification: The paper identifies three concrete symptoms of the standard residual addition inherited from Transformers in Diffusion Transformers (DiTs): monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy across depth and denoising timestep.
- Proposed Solution: It introduces Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs.
- Key Results: On ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67) and matches the baseline's converged quality with 8.75× fewer training iterations. When stacked with REPA, it yields a 2× training acceleration in the early stage.
- Orthogonal Contribution: The gains from DAR are orthogonal to existing representation-alignment objectives (e.g., REPA), positioning cross-layer information routing as a new, complementary design axis for diffusion models.
Introduction and Theoretical Foundation
Diffusion Transformers (DiTs) have become the dominant backbone for modern visual generation. While nearly every component—tokenization, attention, conditioning, objectives, and latent autoencoders—has been extensively revisited, the residual stream governing cross-layer information accumulation has been directly inherited from the original Transformer. This work argues that this default design is poorly suited for the time-varying dynamics of the denoising process.
The paper is motivated by two key insights:
- Diagnostic Insight: The standard pre-normalized residual stream in DiTs exhibits symptoms analogous to the "PreNorm dilution" phenomenon observed in LLMs, which intensify with depth: hidden-state magnitudes inflate, gradients decay, and adjacent blocks become redundant.
- Architectural Insight: The denoising timestep (
t)—the core dimension distinguishing DiTs from standard Transformers—should play a vital role in how information is routed across layers. As denoising progresses from high to low noise, the most relevant intermediate features shift from coarse structure to fine details, necessitating adaptive, time-aware aggregation.
The goal is to elevate cross-layer information routing from an inherited convention to an explicit, optimized design axis for DiTs.
Methodology
The paper proposes Diffusion-Adaptive Routing (DAR), which replaces the standard fixed residual addition with a learned, timestep-aware aggregation mechanism.
Standard Residual Routing in DiTs
The standard update for sublayer l is:
Unrolling the recurrence gives the accumulated information:
This represents a fixed routing pattern where all previous outputs enter the stream with unit coefficients.
DAR Formulation
Let v_i = f_i(h_i; t) denote the output of the i-
th sublayer, with v_0 = h_0. DAR replaces the unweighted sum with a softmax-weighted aggregation:
where the routing weights are computed via attention:
Here, k_i = \text{RMSNorm}(v_i) is the key for source v_i. The aggregated h_l then enters the next sublayer transformation.
Key Design Choices
- Query Parameterization: The per-layer query
q_l(t)can be:- Static:
q_l(t) = w_l(a learnable vector). - Dynamic:
q_l(t) = W_q^{(l)} v_{l-1}(projection from previous output). - Explicit Timestep Injection:
q_l(t) = w_l + e(t)(reusing DiT's timestep embedding).
- Static:
- Chunked Aggregation: To reduce memory overhead from storing all
Lsources, sublayers are partitioned intoNchunks of sizeS = L/N. The source set for layerlin chunknbecomes: wherec_n := v_{nS}is the summary of chunkn.
DAR preserves the isotropic, homogeneous Transformer stack and is compatible with modern enhancements like REPA.
Empirical Validation / Results
Main Results on ImageNet 256×256
The table below shows a system-level comparison. DAR variants achieve better FID with significantly fewer training iterations than the SiT baseline and outperform other routing methods like U-Net-like skip connections.
| Method | Iters. | Params | w/o guidance FID ↓ | w/ guidance FID ↓ |
|---|---|---|---|---|
| Standard Residuals | ||||
| SiT ode | 1.75M | 675M | 9.67 | 2.15 |
| SiT-Plus ode | 1M | 752M | 10.85 | 2.36 |
| U-Net-Like Routing | ||||
| U-DiT-L sde | 250K | 810M | 7.54 | 3.00 |
| Our Method (DAR) | ||||
| Static c4 ode | 600K | 675M | 7.56 | 2.08 |
| Dynamic c4 ode | 500K | 751M | 8.07 | 2.05 |
Table 1: System-level comparison on ImageNet 256×256. 'c4' denotes a chunk size of 4.
- Faster Convergence: DAR static variant matches the baseline's converged quality (∼9.67 FID) in roughly 8.75× fewer iterations.
- Superior Final Quality: DAR achieves a best FID of 6.92 (SDE, no CFG), a 2.11 improvement over the SiT baseline at matched compute.
Ablation Studies and Analysis
- Timestep Awareness is Crucial: Ablations show that both timestep-aware query variants (dynamic and explicit injection) substantially outperform the timestep-blind (pure static) variant.
| Method | 100K | 200K | 400K |
|---|---|---|---|
| Static w/o t-injection | 22.36 | 15.47 | 11.51 |
| Dynamic | 13.95 | 9.29 | 8.10 |
| Static w/ t-injection | 17.39 | 10.12 | 7.97 |
Table 2: Ablation on timestep awareness in DAR (FID ↓ at different iterations).
- Orthogonality to REPA: DAR's gains compound with those from the representation-alignment objective REPA, confirming they operate on orthogonal axes.
| Method | 100K | 200K | 300K |
|---|---|---|---|
| SiT + REPA | 9.89 | 6.89 | 6.29 |
| DAR + REPA | 7.09 | 5.92 | 5.68 |
Table 3: Compatibility with REPA (FID ↓ at different iterations).
- Optimal Chunk Size: A sweep of chunk size
Sreveals a U-shaped performance pattern, withS=4being optimal for SiT-XL/2 (L=56). This is predicted by a rate-distortion model where the optimalS*scales with√L.
Chunk size S | 1 | 4 | 8 |
|---|---|---|---|
| FID ↓ | 10.41 | 8.39 | 11.14 |
| IS ↑ | 107.2 | 121.7 | 103.51 |
Table 4: Effects of chunk size S on SiT-XL/2 (300K iterations).
- Application to Large-Scale T2I: When applied during Distribution Matching Distillation (DMD) of Qwen-Image, DAR helps preserve high-frequency details (sharp edges, fine textures) that are often attenuated during aggressive few-step distillation.
Theoretical and Practical Implications
- New Design Axis: The work establishes cross-layer information routing as a significant and previously underexplored architectural dimension for improving diffusion models, operating orthogonally to advances in conditioning, objectives, and backbone scaling.
- Diagnostic Framework: The three identified symptoms (magnitude inflation, gradient decay, redundancy) provide a concrete diagnostic framework for analyzing information flow in deep generative Transformers.
- Practical Benefits: DAR offers drop-in compatibility with existing DiT architectures and training methods (e.g., REPA), leading to substantial improvements in training efficiency (faster convergence, fewer iterations) and final output quality (lower FID, better detail preservation).
- Theoretical Underpinning: The optimal chunk size analysis (
S* = √[L * (1-α)/(1+α)]) provides a principled guideline for scaling DAR to deeper models, suggesting its benefits may widen with increasing model depth.
Conclusion
This paper presents a systematic diagnosis of cross-layer information flow in Diffusion Transformers, identifying key limitations of the standard residual stream. In response, it proposes Diffusion-Adaptive Routing (DAR), a novel mechanism that enables learnable, timestep-conditioned aggregation across layers.
Empirical results demonstrate that DAR significantly accelerates training and improves final generation quality on ImageNet. Its orthogonality to representation-alignment objectives like REPA highlights cross-layer routing as a promising new direction for architectural innovation in diffusion modeling.
Future work involves scaling DAR to multi-billion parameter T2I and T2V backbones and exploring its benefits across a broader range of post-training objectives (fine-tuning, preference optimization, distillation).