Technical Report: Attention Residuals

Summary (Overview)

Core Contribution: Proposes Attention Residuals (AttnRes), a novel mechanism that replaces the fixed, uniform summation of standard residual connections with learned, input-dependent softmax attention over preceding layer outputs. This allows each layer to selectively retrieve and aggregate information from any earlier layer.
Scalable Variant: Introduces Block AttnRes, which groups layers into blocks and attends over block-level summaries, reducing memory and communication overhead from $O(Ld)$ to $O(Nd)$ , making it practical for large-scale training.
Key Findings: AttnRes mitigates the PreNorm dilution problem, preventing unbounded hidden-state growth and leading to more uniform gradient distributions across depth. Scaling laws show consistent improvement, with Block AttnRes matching a baseline trained with 1.25x more compute.
Empirical Validation: Integrated into a 48B-parameter MoE model (Kimi Linear) and pre-trained on 1.4T tokens. AttnRes improved performance across all downstream benchmarks, with notable gains on multi-step reasoning (e.g., GPQA-Diamond +7.5) and code generation (e.g., HumanEval +3.1) tasks.

Introduction and Theoretical Foundation

Standard residual connections $h_l = h_{l-1} + f_{l-1}(h_{l-1})$ are foundational for training deep networks. While effective as gradient highways, they have a secondary, less-studied role: they define how information aggregates across depth. Unrolling the recurrence reveals that every layer receives a uniformly weighted sum of all prior outputs:

h_l = h_1 + \sum_{i=1}^{l-1} f_i(h_i)

This fixed, unit-weight accumulation lacks a mechanism for selective emphasis or suppression of individual layer contributions. With the prevalent PreNorm scheme, this leads to hidden-state magnitudes growing as $O(L)$ with depth, progressively diluting each layer's relative contribution [27]. Early-layer information is buried and cannot be selectively retrieved.

The paper identifies a formal duality between depth-wise accumulation and the sequential recurrence in RNNs. Just as the Transformer replaced RNN recurrence with attention over sequences, the authors propose replacing the additive depth recurrence with attention over depth. This leads to the core formulation of Attention Residuals:

h_l = \alpha_{0 \to l} \cdot h_1 + \sum_{i=1}^{l-1} \alpha_{i \to l} \cdot f_i(h_i)

where $\alpha_{i \to l}$ are softmax attention weights computed in a content-dependent manner, enabling selective, learned retrieval across the network's depth.

Methodology

Full Attention Residuals

For each layer $l$ , a learned pseudo-query vector $w_l \in \mathbb{R}^d$ is used to compute attention weights over all preceding layer outputs (and the initial embedding $h_1$ ). The keys and values are the layer outputs themselves:

Query: $q_l = w_l$
Key/Value: $k_i = v_i = \begin{cases} h_1 & i=0 \\ f_i(h_i) & 1 \le i \le l-1 \end{cases}$

The attention weights use a softmax kernel with RMSNorm on the keys to prevent magnitude bias:

\alpha_{i \to l} = \frac{\phi(q_l, k_i)}{\sum_{j=0}^{l-1} \phi(q_l, k_j)}, \quad \text{where } \phi(q, k) = \exp\left(q^\top \text{RMSNorm}(k)\right)

The input to layer $l$ is then: $h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i$ .

Overhead: In standard training, Full AttnRes adds negligible memory overhead as layer outputs are already retained for backpropagation. However, under pipeline parallelism, it requires transmitting all $L$ layer outputs across stages, incurring $O(Ld)$ communication.

Block Attention Residuals

To address the scaling challenge, layers are partitioned into $N$ blocks of size $S = L/N$ .

Intra-Block Accumulation: Within a block $n$ , layer outputs are summed into a single block representation:
$b_n = \sum_{j \in \mathcal{B}_n} f_j(h_j)$
where $\mathcal{B}_n$ is the set of layer indices in block $n$ .
Inter-Block Attention: Each layer attends over the set of completed block representations $\\{b_0, b_1, ..., b_{n-1}\\}$ (where $b_0 = h_1$ ) and, for layers after the first in a block, also over the current intra-block partial sum $b_n^{i-1}$ .

This reduces the number of representations that must be stored and communicated from $L$ to $N$ , lowering memory and communication overhead from $O(Ld)$ to $O(Nd)$ .

Infrastructure Optimizations

The paper introduces several system-level optimizations to make Block AttnRes efficient at scale:

Cross-Stage Caching for Training: Under pipeline parallelism, previously received block representations are cached locally, eliminating redundant transmissions between virtual stages. This reduces peak per-transition communication cost from $O(C)$ to $O(P)$ (where $C$ is total chunks, $P$ is physical stages).
Two-Phase Computation for Inference:
- Phase 1 (Parallel): Batches all pseudo-queries $w_l$ within a block for a single, amortized attention computation over the cached block representations.
- Phase 2 (Sequential): Computes attention over the evolving intra-block partial sum for each layer and merges the results with Phase 1 outputs using online softmax.
Memory-Efficient Prefilling: Block representations are sharded along the sequence dimension across tensor-parallel devices to reduce memory footprint during long-context prefilling.

Pseudo-Query Initialization: A critical detail is that all pseudo-query vectors $w_l$ must be initialized to zero. This ensures initial attention weights are uniform, making AttnRes equivalent to an equal-weight average at training start, which prevents training instability.

Empirical Validation / Results

Scaling Law Experiments

Models of five different sizes were trained (see Table 2). Both Full and Block AttnRes ( $N \approx 8$ ) consistently outperformed the PreNorm baseline across all compute budgets.

Table 2: Model configurations and validation loss for scaling law experiments. | # Act. Params † | Tokens | $L_b$ | $H$ | $d_{model}$ | $d_{ff}$ | lr | batch size ‡ | Validation Loss | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | | | | | | | | Baseline | Block AttnRes | Full AttnRes | mHC(-lite) | | 194M | 038.7B | 12 | 12 | 0896 | 400 | $2.99 \times 10^{-3}$ | 192 | 1.931 | 1.909 | 1.899 | 1.906 | | 241M | 045.4B | 13 | 13 | 0960 | 432 | $2.80 \times 10^{-3}$ | 256 | 1.895 | 1.875 | 1.874 | 1.869 | | 296M | 062.1B | 14 | 14 | 1024 | 464 | $2.50 \times 10^{-3}$ | 320 | 1.829 | 1.809 | 1.804 | 1.807 | | 436M | 087.9B | 16 | 16 | 1168 | 528 | $2.20 \times 10^{-3}$ | 384 | 1.766 | 1.746 | 1.737 | 1.747 | | 528M | 119.0B | 17 | 17 | 1264 | 560 | $2.02 \times 10^{-3}$ | 432 | 1.719 | 1.693 | 1.692 | 1.694 | † Activated parameters in MoE models. ‡ Context length = 8192. $L_b = L/2$ = number of Transformer blocks.

Fitted Scaling Curves:
- Baseline: $L = 1.891 \times C^{-0.057}$
- Block AttnRes: $L = 1.870 \times C^{-0.058}$
- Full AttnRes: $L = 1.865 \times C^{-0.057}$
Compute Equivalence: At 5.6 PFLOP/s-days, Block AttnRes achieves a loss of 1.692 vs. the baseline's 1.714, equivalent to a 1.25x compute advantage.

Main Results on 48B Model

A 48B total parameter (3B activated) Kimi Linear MoE model was pre-trained on 1.4T tokens with Block AttnRes ( $N=9$ blocks).

Training Dynamics Analysis (Fig. 5):

Validation Loss: AttnRes achieved consistently lower loss.
Output Magnitude: The baseline showed unbounded growth with depth (PreNorm dilution), while AttnRes confined growth within blocks, producing a bounded periodic pattern.
Gradient Magnitude: The baseline had disproportionately large gradients in early layers. AttnRes led to a more uniform gradient distribution across depth.

Table 3: Downstream performance comparison after pre-training.

Benchmark	Baseline	AttnRes
General
MMLU	73.5	74.6
MMLU-Pro	52.2	52.2
GPQA-Diamond	36.9	44.4
BBH	76.3	78.0
ARC-Challenge	64.6	65.7
HellaSwag	83.2	83.4
TriviaQA	69.9	71.8
Math & Code
GSM8K	81.7	82.4
MGSM	64.9	66.1
Math	53.5	57.1
CMath	84.7	85.1
HumanEval	59.1	62.2
MBPP	72.0	73.9
Chinese
CMMLU	82.0	82.9
C-Eval	79.6	82.5

AttnRes matched or outperformed the baseline on all 15 benchmarks, with particularly strong gains on multi-step reasoning and code generation tasks.

Ablation Studies

Key findings from component ablations on a 16-layer model (Table 4):

Importance of Input-Dependence: DenseFormer (static weights) showed no gain (1.767 vs. baseline 1.766), while AttnRes (dynamic) achieved 1.737.
Block Size Trade-off: Performance degrades gracefully as block size $S$ increases (Fig. 6). $S=4$ (1.746) recovers most of the gain of Full AttnRes (1.737).
Mechanism Design:
- Softmax vs. Sigmoid: Softmax (competitive normalization) performed better (1.737 vs. 1.741).
- RMSNorm on Keys: Crucial for performance; prevents magnitude bias (1.737 vs. 1.743 w/o RMSNorm).
- Multi-Head Attention: Per-head depth aggregation hurt performance (1.752 vs. 1.746), suggesting optimal depth-wise mixture is largely uniform across channels.

Analysis of Learned Patterns

Visualization of depth-wise attention weights $\alpha_{i \to l}$ (Fig. 8) revealed:

Preserved Locality: Each layer attends most strongly to its immediate predecessor (diagonal dominance).
Learned Skip Connections: Selective off-diagonal concentrations emerge, e.g., layer 4 attending to early sources.
Layer Specialization: The embedding $h_1$ retains non-trivial weight throughout, especially before attention layers. Pre-MLP layers show sharper reliance on recent representations.
Block Structure Preservation: Block AttnRes maintains the essential patterns (diagonal dominance, embedding persistence) of Full AttnRes.

Theoretical and Practical Implications

Theoretical Insight: Residual Connections as Structured Matrices

The paper provides a unified view by framing residual variants via a depth mixing matrix $M \in \mathbb{R}^{L \times L}$ , where $M_{i \to l}$ is the weight layer $l$ assigns to the output of layer $i$ . The input is $h_l = \sum_{i=0}^{l-1} M_{i \to l} v_i$ .

Standard Residuals: $M$ is an all-ones lower-triangular matrix (rank-1 semiseparable).
Highway Networks: $M$ is 1-semiseparable with input-dependent weights.
(m)HC: $M_{i \to l} = \beta_i^\top A_{i+1 \to l}^\times \alpha_l$ , making $M$ $m$ -semiseparable. This corresponds to depth-wise linear attention.
Full AttnRes: $M$ is a dense matrix of softmax attention scores, corresponding to depth-wise softmax attention.

This perspective shows AttnRes completes the transition from linear to softmax attention over depth, mirroring the transformative shift that occurred over sequences.

Practical Implications

Mitigates PreNorm Dilution: By allowing selective aggregation, AttnRes bounds hidden-state growth and leads to more uniform gradient flow.
Enables Selective Information Retrieval: Later layers can directly access and emphasize useful representations from any earlier layer, which is particularly beneficial for multi-step reasoning.
Scalable and Efficient: Block AttnRes, with its system optimizations, serves as a practical drop-in replacement for standard residuals, with:
- < 4% training overhead under pipeline parallelism.
- < 2% inference latency overhead on typical workloads.
Architectural Preference: Architecture sweeps suggest AttnRes allows models to benefit more effectively from increased depth compared to standard residuals.

Conclusion

Attention Residuals (AttnRes) rethinks the fundamental residual connection by introducing learned, input-dependent attention over depth. It addresses key limitations of standard residuals, namely fixed aggregation and the PreNorm dilution problem. The Block AttnRes variant makes the approach scalable for large-model training with minimal overhead. Empirical results demonstrate consistent improvements across model scales and downstream tasks. The work draws a formal duality between sequence and depth, positioning AttnRes as the depth-wise analog of the softmax attention that revolutionized sequence modeling.

Future Directions: As hardware constraints relax, exploring finer-grained block sizes or Full AttnRes is a natural path. Incorporating more memory-efficient (e.g., linear-complexity) attention alternatives over depth is also a promising research direction.

Summary