# Attention Residuals

> Attention Residuals replace fixed residual connections with learned attention over preceding layers, enabling selective information retrieval and improving performance on reasoning and code tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2603.15031)
- **Published:** 2026-03-18
- **Permalink:** https://picx.dev/p/2znetu
- **Whiteboard:** https://picx.dev/p/2znetu/image

## Summary

# Technical Report: Attention Residuals

## Summary (Overview)
*   **Core Contribution**: Proposes **Attention Residuals (AttnRes)**, a novel mechanism that replaces the fixed, uniform summation of standard residual connections with learned, input-dependent **softmax attention over preceding layer outputs**. This allows each layer to selectively retrieve and aggregate information from any earlier layer.
*   **Scalable Variant**: Introduces **Block AttnRes**, which groups layers into blocks and attends over block-level summaries, reducing memory and communication overhead from $O(Ld)$ to $O(Nd)$, making it practical for large-scale training.
*   **Key Findings**: AttnRes mitigates the **PreNorm dilution problem**, preventing unbounded hidden-state growth and leading to more uniform gradient distributions across depth. Scaling laws show consistent improvement, with Block AttnRes matching a baseline trained with **1.25x more compute**.
*   **Empirical Validation**: Integrated into a 48B-parameter MoE model (Kimi Linear) and pre-trained on 1.4T tokens. AttnRes improved performance across all downstream benchmarks, with notable gains on multi-step reasoning (e.g., GPQA-Diamond +7.5) and code generation (e.g., HumanEval +3.1) tasks.

## Introduction and Theoretical Foundation
Standard residual connections $h_l = h_{l-1} + f_{l-1}(h_{l-1})$ are foundational for training deep networks. While effective as gradient highways, they have a secondary, less-studied role: they define how information **aggregates across depth**. Unrolling the recurrence reveals that every layer receives a uniformly weighted sum of all prior outputs:

$$
h_l = h_1 + \sum_{i=1}^{l-1} f_i(h_i)
$$

This **fixed, unit-weight accumulation** lacks a mechanism for selective emphasis or suppression of individual layer contributions. With the prevalent **PreNorm** scheme, this leads to hidden-state magnitudes growing as $O(L)$ with depth, progressively **diluting each layer's relative contribution** [27]. Early-layer information is buried and cannot be selectively retrieved.

The paper identifies a **formal duality between depth-wise accumulation and the sequential recurrence in RNNs**. Just as the Transformer replaced RNN recurrence with attention over sequences, the authors propose replacing the additive depth recurrence with **attention over depth**. This leads to the core formulation of Attention Residuals:

$$
h_l = \alpha_{0 \to l} \cdot h_1 + \sum_{i=1}^{l-1} \alpha_{i \to l} \cdot f_i(h_i)
$$

where $\alpha_{i \to l}$ are **softmax attention weights** computed in a content-dependent manner, enabling selective, learned retrieval across the network's depth.

## Methodology
### Full Attention Residuals
For each layer $l$, a **learned pseudo-query vector** $w_l \in \mathbb{R}^d$ is used to compute attention weights over all preceding layer outputs (and the initial embedding $h_1$). The keys and values are the layer outputs themselves:

*   Query: $q_l = w_l$
*   Key/Value: $k_i = v_i = \begin{cases} h_1 & i=0 \\ f_i(h_i) & 1 \le i \le l-1 \end{cases}$

The attention weights use a softmax kernel with RMSNorm on the keys to prevent magnitude bias:

$$
\alpha_{i \to l} = \frac{\phi(q_l, k_i)}{\sum_{j=0}^{l-1} \phi(q_l, k_j)}, \quad \text{where } \phi(q, k) = \exp\left(q^\top \text{RMSNorm}(k)\right)
$$

The input to layer $l$ is then: $h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i$.

> **Overhead**: In standard training, Full AttnRes adds negligible memory overhead as layer outputs are already retained for backpropagation. However, under pipeline parallelism, it requires transmitting all $L$ layer outputs across stages, incurring $O(Ld)$ communication.

### Block Attention Residuals
To address the scaling challenge, layers are partitioned into $N$ blocks of size $S = L/N$.

1.  **Intra-Block Accumulation**: Within a block $n$, layer outputs are summed into a single **block representation**:
    $$b_n = \sum_{j \in \mathcal{B}_n} f_j(h_j)$$
    where $\mathcal{B}_n$ is the set of layer indices in block $n$.

2.  **Inter-Block Attention**: Each layer attends over the set of **completed block representations** $\\{b_0, b_1, ..., b_{n-1}\\}$ (where $b_0 = h_1$) and, for layers after the first in a block, also over the current **intra-block partial sum** $b_n^{i-1}$.

This reduces the number of representations that must be stored and communicated from $L$ to $N$, lowering memory and communication overhead from $O(Ld)$ to $O(Nd)$.

### Infrastructure Optimizations
The paper introduces several system-level optimizations to make Block AttnRes efficient at scale:

*   **Cross-Stage Caching for Training**: Under pipeline parallelism, previously received block representations are cached locally, eliminating redundant transmissions between virtual stages. This reduces peak per-transition communication cost from $O(C)$ to $O(P)$ (where $C$ is total chunks, $P$ is physical stages).
*   **Two-Phase Computation for Inference**:
    *   **Phase 1 (Parallel)**: Batches all pseudo-queries $w_l$ within a block for a single, amortized attention computation over the cached block representations.
    *   **Phase 2 (Sequential)**: Computes attention over the evolving intra-block partial sum for each layer and merges the results with Phase 1 outputs using **online softmax**.
*   **Memory-Efficient Prefilling**: Block representations are sharded along the sequence dimension across tensor-parallel devices to reduce memory footprint during long-context prefilling.

**Pseudo-Query Initialization**: A critical detail is that all pseudo-query vectors $w_l$ must be **initialized to zero**. This ensures initial attention weights are uniform, making AttnRes equivalent to an equal-weight average at training start, which prevents training instability.

## Empirical Validation / Results
### Scaling Law Experiments
Models of five different sizes were trained (see Table 2). Both Full and Block AttnRes ($N \approx 8$) consistently outperformed the PreNorm baseline across all compute budgets.

**Table 2: Model configurations and validation loss for scaling law experiments.**
| # Act. Params † | Tokens | $L_b$ | $H$ | $d_{model}$ | $d_{ff}$ | lr | batch size ‡ | **Validation Loss** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| | | | | | | | | **Baseline** | **Block AttnRes** | **Full AttnRes** | **mHC(-lite)** |
| 194M | 038.7B | 12 | 12 | 0896 | 400 | $2.99 \times 10^{-3}$ | 192 | 1.931 | 1.909 | 1.899 | 1.906 |
| 241M | 045.4B | 13 | 13 | 0960 | 432 | $2.80 \times 10^{-3}$ | 256 | 1.895 | 1.875 | 1.874 | 1.869 |
| 296M | 062.1B | 14 | 14 | 1024 | 464 | $2.50 \times 10^{-3}$ | 320 | 1.829 | 1.809 | 1.804 | 1.807 |
| 436M | 087.9B | 16 | 16 | 1168 | 528 | $2.20 \times 10^{-3}$ | 384 | 1.766 | 1.746 | 1.737 | 1.747 |
| 528M | 119.0B | 17 | 17 | 1264 | 560 | $2.02 \times 10^{-3}$ | 432 | 1.719 | 1.693 | 1.692 | 1.694 |
*† Activated parameters in MoE models. ‡ Context length = 8192. $L_b = L/2$ = number of Transformer blocks.*

*   **Fitted Scaling Curves**:
    *   Baseline: $L = 1.891 \times C^{-0.057}$
    *   Block AttnRes: $L = 1.870 \times C^{-0.058}$
    *   Full AttnRes: $L = 1.865 \times C^{-0.057}$
*   **Compute Equivalence**: At 5.6 PFLOP/s-days, Block AttnRes achieves a loss of 1.692 vs. the baseline's 1.714, equivalent to a **1.25x compute advantage**.

### Main Results on 48B Model
A 48B total parameter (3B activated) Kimi Linear MoE model was pre-trained on 1.4T tokens with Block AttnRes ($N=9$ blocks).

**Training Dynamics Analysis** (Fig. 5):
*   **Validation Loss**: AttnRes achieved consistently lower loss.
*   **Output Magnitude**: The baseline showed unbounded growth with depth (PreNorm dilution), while AttnRes confined growth within blocks, producing a bounded periodic pattern.
*   **Gradient Magnitude**: The baseline had disproportionately large gradients in early layers. AttnRes led to a **more uniform gradient distribution** across depth.

**Table 3: Downstream performance comparison after pre-training.**
| Benchmark | **Baseline** | **AttnRes** |
| :--- | :--- | :--- |
| **General** | | |
| MMLU | 73.5 | **74.6** |
| MMLU-Pro | 52.2 | 52.2 |
| GPQA-Diamond | 36.9 | **44.4** |
| BBH | 76.3 | **78.0** |
| ARC-Challenge | 64.6 | **65.7** |
| HellaSwag | 83.2 | **83.4** |
| TriviaQA | 69.9 | **71.8** |
| **Math & Code** | | |
| GSM8K | 81.7 | **82.4** |
| MGSM | 64.9 | **66.1** |
| Math | 53.5 | **57.1** |
| CMath | 84.7 | **85.1** |
| HumanEval | 59.1 | **62.2** |
| MBPP | 72.0 | **73.9** |
| **Chinese** | | |
| CMMLU | 82.0 | **82.9** |
| C-Eval | 79.6 | **82.5** |

AttnRes matched or outperformed the baseline on **all 15 benchmarks**, with particularly strong gains on multi-step reasoning and code generation tasks.

### Ablation Studies
Key findings from component ablations on a 16-layer model (Table 4):

*   **Importance of Input-Dependence**: DenseFormer (static weights) showed no gain (1.767 vs. baseline 1.766), while AttnRes (dynamic) achieved 1.737.
*   **Block Size Trade-off**: Performance degrades gracefully as block size $S$ increases (Fig. 6). $S=4$ (1.746) recovers most of the gain of Full AttnRes (1.737).
*   **Mechanism Design**:
    *   **Softmax vs. Sigmoid**: Softmax (competitive normalization) performed better (1.737 vs. 1.741).
    *   **RMSNorm on Keys**: Crucial for performance; prevents magnitude bias (1.737 vs. 1.743 w/o RMSNorm).
    *   **Multi-Head Attention**: Per-head depth aggregation hurt performance (1.752 vs. 1.746), suggesting optimal depth-wise mixture is largely uniform across channels.

### Analysis of Learned Patterns
Visualization of depth-wise attention weights $\alpha_{i \to l}$ (Fig. 8) revealed:
*   **Preserved Locality**: Each layer attends most strongly to its immediate predecessor (diagonal dominance).
*   **Learned Skip Connections**: Selective off-diagonal concentrations emerge, e.g., layer 4 attending to early sources.
*   **Layer Specialization**: The embedding $h_1$ retains non-trivial weight throughout, especially before attention layers. Pre-MLP layers show sharper reliance on recent representations.
*   **Block Structure Preservation**: Block AttnRes maintains the essential patterns (diagonal dominance, embedding persistence) of Full AttnRes.

## Theoretical and Practical Implications
### Theoretical Insight: Residual Connections as Structured Matrices
The paper provides a unified view by framing residual variants via a **depth mixing matrix** $M \in \mathbb{R}^{L \times L}$, where $M_{i \to l}$ is the weight layer $l$ assigns to the output of layer $i$. The input is $h_l = \sum_{i=0}^{l-1} M_{i \to l} v_i$.

*   **Standard Residuals**: $M$ is an all-ones lower-triangular matrix (rank-1 semiseparable).
*   **Highway Networks**: $M$ is 1-semiseparable with input-dependent weights.
*   **(m)HC**: $M_{i \to l} = \beta_i^\top A_{i+1 \to l}^\times \alpha_l$, making $M$ $m$-semiseparable. This corresponds to **depth-wise linear attention**.
*   **Full AttnRes**: $M$ is a dense matrix of softmax attention scores, corresponding to **depth-wise softmax attention**.

This perspective shows AttnRes completes the transition from linear to softmax attention over depth, mirroring the transformative shift that occurred over sequences.

### Practical Implications
*   **Mitigates PreNorm Dilution**: By allowing selective aggregation, AttnRes bounds hidden-state growth and leads to more uniform gradient flow.
*   **Enables Selective Information Retrieval**: Later layers can directly access and emphasize useful representations from any earlier layer, which is particularly beneficial for multi-step reasoning.
*   **Scalable and Efficient**: Block AttnRes, with its system optimizations, serves as a practical drop-in replacement for standard residuals, with:
    *   **< 4%** training overhead under pipeline parallelism.
    *   **< 2%** inference latency overhead on typical workloads.
*   **Architectural Preference**: Architecture sweeps suggest AttnRes allows models to **benefit more effectively from increased depth** compared to standard residuals.

## Conclusion
Attention Residuals (AttnRes) rethinks the fundamental residual connection by introducing learned, input-dependent attention over depth. It addresses key limitations of standard residuals, namely fixed aggregation and the PreNorm dilution problem. The Block AttnRes variant makes the approach scalable for large-model training with minimal overhead. Empirical results demonstrate consistent improvements across model scales and downstream tasks. The work draws a formal duality between sequence and depth, positioning AttnRes as the depth-wise analog of the softmax attention that revolutionized sequence modeling.

**Future Directions**: As hardware constraints relax, exploring finer-grained block sizes or Full AttnRes is a natural path. Incorporating more memory-efficient (e.g., linear-complexity) attention alternatives over depth is also a promising research direction.

---

_Markdown view of https://picx.dev/p/2znetu, served by PicX — AI-generated visual whiteboard summaries of research papers._