# Attention Is All You Need

> The Transformer architecture replaces recurrence and convolution with multi-head self-attention, achieving superior parallelization and state-of-the-art translation performance.

- **Source:** [arXiv](https://arxiv.org/abs/1706.03762)
- **Published:** 2026-03-07
- **Permalink:** https://picx.dev/p/X7eYGU
- **Whiteboard:** https://picx.dev/p/X7eYGU/image

## Summary

# Attention Is All You Need - Summary

## Summary (Overview)
*   **Proposes the Transformer**, a novel neural network architecture for sequence transduction based entirely on attention mechanisms, dispensing with recurrence and convolution.
*   **Introduces Multi-Head Attention**, which allows the model to jointly attend to information from different representation subspaces at different positions.
*   **Achieves new state-of-the-art results** on WMT 2014 English-to-German (28.4 BLEU) and English-to-French (41.8 BLEU) translation tasks, with significantly faster training times.
*   **Demonstrates superior parallelization** and shorter path lengths for long-range dependencies compared to recurrent and convolutional networks.
*   **Shows strong generalization** to other tasks, achieving competitive results on English constituency parsing with both limited and large training data.

## Introduction and Theoretical Foundation
The dominant sequence transduction models (e.g., for machine translation) were based on complex **Recurrent Neural Networks (RNNs)** or **Convolutional Neural Networks (CNNs)** in an encoder-decoder framework, often enhanced with attention mechanisms. While effective, these architectures have inherent limitations:
*   **RNNs** process sequences sequentially, which precludes parallelization within training examples and becomes a bottleneck for long sequences.
*   **CNNs** require multiple layers (e.g., $O(n/k)$ or $O(\log_k(n))$) to connect distant positions, increasing the path length for dependencies.

**Attention mechanisms** had become crucial for modeling dependencies regardless of distance but were almost exclusively used in conjunction with recurrent layers. The paper's core thesis is that **attention mechanisms alone are sufficient** for building a powerful sequence model. The proposed **Transformer** architecture eliminates recurrence, relying solely on self-attention to draw global dependencies between input and output, enabling massive parallelization and more efficient learning of long-range dependencies.

## Methodology

### Model Architecture
The Transformer follows the encoder-decoder structure. The encoder maps an input sequence $(x_1, ..., x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, ..., z_n)$. The decoder then generates an output sequence $(y_1, ..., y_m)$ auto-regressively, consuming previously generated symbols.

**Encoder:** A stack of $N = 6$ identical layers. Each layer has two sub-layers:
1.  A **multi-head self-attention mechanism**.
2.  A simple, **position-wise fully connected feed-forward network**.
A residual connection is employed around each sub-layer, followed by layer normalization: $\text{LayerNorm}(x + \text{Sublayer}(x))$. All layers output vectors of dimension $d_{\text{model}} = 512$.

**Decoder:** Also a stack of $N = 6$ identical layers. It has three sub-layers per layer:
1.  A **masked multi-head self-attention** layer (to prevent looking ahead).
2.  A **multi-head attention layer** over the encoder's output.
3.  A **position-wise feed-forward network**.
Residual connections and layer normalization are also applied. The masking ensures the auto-regressive property.

### Attention Mechanism
The core innovation is the **Scaled Dot-Product Attention** and its extension to **Multi-Head Attention**.

**Scaled Dot-Product Attention:**
The input consists of queries ($Q$), keys ($K$) of dimension $d_k$, and values ($V$) of dimension $d_v$. The attention is computed as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

> **Key Insight:** The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents the dot products from growing large in magnitude, which would push the softmax into regions of extremely small gradients.

**Multi-Head Attention:**
Instead of one attention function, the model linearly projects the queries, keys, and values $h$ times with different learned projections. Attention is performed in parallel on these projected versions, and the outputs are concatenated and projected again.

$$
\begin{aligned}
\text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \\
\text{where head}_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
\end{aligned}
$$

Where $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$, and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$.
The paper uses $h=8$ heads with $d_k = d_v = d_{\text{model}}/h = 64$.

**Applications in the Model:**
*   **Encoder Self-Attention:** All keys, values, queries come from the previous encoder layer.
*   **Masked Decoder Self-Attention:** Allows each position to attend only to earlier positions.
*   **Encoder-Decoder Attention:** Queries from decoder, keys and values from encoder output.

### Position-wise Feed-Forward Networks
Each layer contains a fully connected feed-forward network applied identically to each position:

$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

With $d_{\text{model}}=512$ and inner-layer dimensionality $d_{ff}=2048$.

### Positional Encoding
Since the model contains no recurrence or convolution, **positional encodings** are added to the input embeddings to inject information about token order. The paper uses sinusoidal functions:

$$
\begin{aligned}
PE_{(pos, 2i)} &= \sin(pos / 10000^{2i/d_{\text{model}}}) \\
PE_{(pos, 2i+1)} &= \cos(pos / 10000^{2i/d_{\text{model}}})
\end{aligned}
$$

Where $pos$ is the position and $i$ is the dimension. This allows the model to potentially extrapolate to sequence lengths longer than those seen during training.

### Why Self-Attention? A Comparative Analysis
The paper compares self-attention layers to recurrent and convolutional layers on three key metrics (see Table 1):

**Table 1: Complexity Comparison of Layer Types**
| Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
| :--- | :--- | :--- | :--- |
| Self-Attention | $O(n^2 \cdot d)$ | $O(1)$ | $O(1)$ |
| Recurrent | $O(n \cdot d^2)$ | $O(n)$ | $O(n)$ |
| Convolutional | $O(k \cdot n \cdot d^2)$ | $O(1)$ | $O(\log_k(n))$ |
| Self-Attention (restricted) | $O(r \cdot n \cdot d)$ | $O(1)$ | $O(n/r)$ |

*   **Parallelism:** Self-attention requires a constant number of sequential operations ($O(1)$), unlike RNNs ($O(n)$).
*   **Path Length:** Self-attention creates direct connections between any two positions in the sequence with a path length of $O(1)$, making it easier to learn long-range dependencies compared to RNNs ($O(n)$) or CNNs ($O(\log_k(n))$).

## Empirical Validation / Results

### Machine Translation
The Transformer was evaluated on standard WMT 2014 translation tasks.

**Table 2: Translation Results and Training Cost**
| Model | BLEU | Training Cost (FLOPs) |
| :--- | :--- | :--- |
| | **EN-DE** | **EN-FR** | **EN-DE** | **EN-FR** |
| **Previous SOTA (Ensembles)** | | | | |
| GNMT + RL [38] | 26.30 | 41.16 | $1.8 \cdot 10^{20}$ | $1.1 \cdot 10^{21}$ |
| ConvS2S [9] | 26.36 | 41.29 | $7.7 \cdot 10^{19}$ | $1.2 \cdot 10^{21}$ |
| **Transformer (this work)** | | | | |
| Base Model | 27.3 | 38.1 | $\mathbf{3.3 \cdot 10^{18}}$ | - |
| **Big Model** | **28.4** | **41.8** | $2.3 \cdot 10^{19}$ | - |

*   **English-to-German:** The "big" Transformer model achieved a **BLEU score of 28.4**, improving over the best previous model (including ensembles) by over **2.0 BLEU**.
*   **English-to-French:** The "big" model achieved a **BLEU score of 41.8**, establishing a new single-model state-of-the-art, trained in **3.5 days on 8 GPUs**—a fraction of the cost of previous best models.

### Model Ablation Studies
Experiments on the English-German development set (newstest2013) analyzed the impact of various components (see Table 3 for full details).

**Table 3: Model Variations (Selected Highlights)**
| Change | PPL (dev) | BLEU (dev) | Key Finding |
| :--- | :--- | :--- | :--- |
| **(A) Number of Attention Heads ($h$)** | | | |
| $h=1$ | 5.29 | 24.9 | Single-head attention is 0.9 BLEU worse than best. |
| $h=8$ (base) | 4.92 | 25.8 | Optimal performance. |
| $h=16$ | 5.01 | 25.4 | Quality drops with too many heads. |
| **(C) Model Size ($d_{model}, d_{ff}$)** | | | |
| $d_{model}=256, d_{ff}=1024$ | 5.75 | 24.5 | Smaller model, worse performance. |
| $d_{model}=1024, d_{ff}=4096$ | 4.75 | 26.2 | Bigger models are better. |
| **(E) Positional Encoding** | | | |
| Learned embeddings | 4.92 | 25.7 | Similar results to sinusoidal encoding. |

Key findings from ablation:
*   Multi-head attention is crucial; the optimal number of heads is 8.
*   Larger models and the use of dropout ($P_{drop}=0.1$) consistently improve performance.
*   Sinusoidal and learned positional encodings yield nearly identical results.

### English Constituency Parsing
To test generalization, a 4-layer Transformer ($d_{model}=1024$) was applied to English constituency parsing on the WSJ Penn Treebank.

**Table 4: English Constituency Parsing Results (F1 on WSJ Section 23)**
| Parser | Training | F1 |
| :--- | :--- | :--- |
| **WSJ Only (~40K sentences)** | | |
| Dyer et al. (2016) [8] | WSJ only | 91.7 |
| **Transformer (4 layers)** | **WSJ only** | **91.3** |
| **Semi-supervised (~17M sentences)** | | |
| Vinyals & Kaiser et al. (2014) [37] | semi-supervised | 92.1 |
| **Transformer (4 layers)** | **semi-supervised** | **92.7** |

The Transformer achieved strong results **without task-specific architecture changes**, outperforming all previous models except the Recurrent Neural Network Grammar in the semi-supervised setting and outperforming the BerkeleyParser even when trained only on the small WSJ set.

## Theoretical and Practical Implications

**Theoretical Implications:**
*   **Challenges the Necessity of Recurrence:** Demonstrates that sequential computation is not a fundamental requirement for powerful sequence modeling. Self-attention provides a compelling alternative that offers direct, constant-length paths between any sequence positions.
*   **Re-frames Attention:** Elevates attention from a supplementary mechanism to the **primary building block** of a state-of-the-art architecture.

**Practical Implications:**
*   **Unprecedented Parallelization:** The architecture's non-recurrent nature allows for drastically faster training times, reducing wall-clock training from weeks to days.
*   **Scalability:** The reduced sequential operation count makes the model more efficient for long sequences.
*   **General-Purpose Architecture:** The Transformer's success on both translation and parsing suggests it is a versatile, general-purpose sequence modeling architecture, paving the way for its application across NLP and beyond.

## Conclusion
The Transformer is the first sequence transduction model based **entirely on attention mechanisms**, replacing the recurrent layers standard in encoder-decoder models. It achieves superior translation quality while being significantly more parallelizable and requiring less time to train. The model's strong performance on a syntactic task (parsing) further indicates its generality.

**Future Directions** outlined include applying attention-based models to tasks with other input/output modalities (images, audio, video), investigating local/restricted attention for large inputs, and making the generation process less sequential. The Transformer architecture established a new paradigm that would become foundational for subsequent models like BERT, GPT, and their successors, revolutionizing the field of natural language processing.

---

_Markdown view of https://picx.dev/p/X7eYGU, served by PicX — AI-generated visual whiteboard summaries of research papers._