Summary of "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"

Summary (Overview)

Core Contribution: Introduces the Vision Transformer (ViT), a model that applies a standard Transformer encoder directly to sequences of image patches for image classification, demonstrating that reliance on convolutional neural networks (CNNs) is not necessary for state-of-the-art performance.
Key Finding: Large-scale pre-training is crucial. While ViT underperforms comparable ResNets on smaller datasets (e.g., ImageNet-1k), it matches or surpasses state-of-the-art CNNs when pre-trained on very large datasets (14M-300M images), showing that "large scale training trumps inductive bias."
Efficiency: ViT attains excellent results while requiring substantially fewer computational resources for pre-training compared to leading convolutional models like BiT and Noisy Student.
Architecture Simplicity: ViT uses a minimally modified Transformer architecture. An image is split into fixed-size patches, linearly embedded, added with position embeddings, and processed by a standard Transformer encoder. A learnable [class] token is used for classification.

Introduction and Theoretical Foundation

The Transformer architecture, dominant in Natural Language Processing (NLP), had seen limited application in computer vision, where attention was typically used in conjunction with or to replace parts of convolutional networks. This paper challenges the necessity of CNNs by proposing a pure Transformer model for vision.

The motivation stems from the scaling successes of Transformers in NLP, where pre-training on large corpora followed by task-specific fine-tuning leads to unprecedented performance. The authors hypothesize that a similar approach could work for images by treating them as sequences of patches, analogous to word tokens in NLP.

The key theoretical shift is the reduction of image-specific inductive bias. Unlike CNNs, which inherently bake in properties like locality, 2D neighborhood structure, and translation equivariance, ViT has minimal such biases:

Only the initial patch splitting and MLP layers are local and translationally equivariant.
Self-attention layers are global.
Spatial relations must be learned entirely from data.

The paper posits that with sufficient data, learning relevant patterns directly is more beneficial than hard-coding architectural priors.

Methodology

Vision Transformer (ViT) Architecture

The model follows the original Transformer (Vaswani et al., 2017) closely. The process is depicted in Figure 1 from the paper.

Patch Embedding: An image $x \in \mathbb{R}^{H \times W \times C}$ is reshaped into a sequence of flattened 2D patches $x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}$ , where $(H, W)$ is the image resolution, $C$ is the number of channels, $(P, P)$ is the patch resolution, and $N = HW / P^2$ is the sequence length. These patches are linearly projected to a $D$ -dimensional space using a trainable matrix $E$ .
$\mathbf{z}_0 = [x_{\text{class}}; \, x_p^1 E; \, x_p^2 E; \, \cdots; \, x_p^N E] + \mathbf{E}_{\text{pos}}, \quad E \in \mathbb{R}^{(P^2 \cdot C) \times D}, \mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}$
Classification Token: Similar to BERT's [CLS] token, a learnable embedding $x_{\text{class}}$ is prepended to the sequence. The state of this token at the encoder output ( $\mathbf{z}_L^0$ ) serves as the image representation $y$ .
Position Embeddings: Standard learnable 1D position embeddings $\mathbf{E}_{\text{pos}}$ are added to retain positional information. The authors found no significant gain from more advanced 2D-aware embeddings.
Transformer Encoder: The sequence is processed by an encoder of $L$ layers, each consisting of Multiheaded Self-Attention (MSA) and MLP blocks with LayerNorm (LN) applied before each block and residual connections after.
$\mathbf{z}'_\ell = \operatorname{MSA}(\operatorname{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \quad \ell = 1 \ldots L$ $\mathbf{z}_\ell = \operatorname{MLP}(\operatorname{LN}(\mathbf{z}'_\ell)) + \mathbf{z}'_\ell, \quad \ell = 1 \ldots L$ $y = \operatorname{LN}(\mathbf{z}_L^0)$

Model Variants and Hybrids

Three model sizes based on BERT configurations were used:

Model	Layers	Hidden Size $D$	MLP Size	Heads	Params
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

Hybrid Architecture: As an alternative to raw patches, the input sequence can be formed from the feature maps of a CNN backbone. The patch embedding projection is then applied to these CNN-derived features.

Training and Fine-tuning

Pre-training: Models are trained on large datasets (ImageNet-21k, JFT-300M) using Adam ( $\beta_1=0.9, \beta_2=0.999$ ) with a batch size of 4096 and a high weight decay of 0.1.
Fine-tuning: Models are fine-tuned on downstream tasks using SGD with momentum, batch size 512, and often at higher resolution than pre-training. For higher resolution, patch size is kept constant, increasing the sequence length. Pre-trained position embeddings are interpolated to match the new 2D grid.
Metrics: Both fine-tuning accuracy and few-shot linear accuracy (solving a regularized least-squares regression on frozen features) are reported.

Empirical Validation / Results

Comparison to State of the Art

ViT models pre-trained on JFT-300M were compared against Big Transfer (BiT) and Noisy Student models.

Table 2: Comparison with state of the art on popular image classification benchmarks.

Model (Pre-training)	ImageNet	ImageNet ReaL	CIFAR-100	VTAB (19 tasks)	TPUv3-core-days
ViT-H/14 (JFT)	88.55 ± 0.04	90.72 ± 0.05	94.55 ± 0.04	77.63 ± 0.23	2.5k
ViT-L/16 (JFT)	87.76 ± 0.03	90.54 ± 0.03	93.90 ± 0.05	76.28 ± 0.46	0.68k
ViT-L/16 (I21k)	85.30 ± 0.02	88.62 ± 0.05	93.25 ± 0.05	72.72 ± 0.21	0.23k
BiT-L (ResNet152x4)	87.54 ± 0.02	90.54	93.51 ± 0.08	76.29 ± 1.70	9.9k
Noisy Student (EffNet-L2)	88.4 / 88.5*	90.55	-	-	12.3k

Key Takeaway: The largest ViT model achieves state-of-the-art or competitive results across all benchmarks while requiring an order of magnitude less compute for pre-training than the leading CNN-based models.

Pre-training Data Requirements

Experiments show the critical importance of dataset scale for ViT:

On small datasets (ImageNet-1k), ViT underperforms ResNets of comparable size due to a lack of inductive biases.
With medium datasets (ImageNet-21k), performance becomes comparable.
On very large datasets (JFT-300M), ViT outperforms ResNets. Figure 4 shows that while ResNets plateau with more data, ViT's performance continues to improve.

Scaling Study

A controlled study of performance versus pre-training compute (Figure 5) shows:

Vision Transformers dominate the performance/compute trade-off, using 2–4× less compute than ResNets to attain the same performance.
Hybrid models (CNN + ViT) slightly outperform pure ViT at small computational budgets, but the gap vanishes for larger models.
Vision Transformer performance did not saturate within the range of models tried, suggesting strong potential for further scaling.

Model Inspection and Analysis

Filters: The initial linear projection learns filters that resemble plausible basis functions for representing patch structure (Figure 7, left).
Position Embeddings: Learned position embeddings encode 2D image topology, with closer patches having more similar embeddings (Figure 7, center).
Attention Distance: Analysis of "attention distance" (analogous to receptive field size) shows that some heads attend to most of the image even in lower layers, while others focus locally. Attention distance increases with network depth (Figure 7, right). In hybrid models, localized attention in early layers is less pronounced, suggesting the CNN backbone handles local feature processing.
Attention Maps: Visualization using Attention Rollout shows that the model attends to image regions that are semantically relevant for classification (Figure 6).

Theoretical and Practical Implications

Theoretical: Challenges the long-held assumption that convolutional inductive biases are essential for computer vision. Demonstrates that a general-purpose sequence modeling architecture (the Transformer) can achieve superior performance when scaled with sufficient data, emphasizing the power of learned representations over hand-designed architectural constraints.
Practical:
- Efficiency: ViT provides a more compute-efficient path to state-of-the-art image recognition, significantly reducing the cost of large-scale model pre-training.
- Unification: Offers a step towards architectural unification across vision and NLP, potentially simplifying model design and infrastructure.
- Scalability: The clear scaling trends suggest that even larger ViT models trained on more data will yield further improvements.
- Transfer Learning: ViT exhibits strong few-shot and transfer learning capabilities, especially on the diverse VTAB tasks, making it a powerful foundation model.

Conclusion

The Vision Transformer (ViT) demonstrates that a pure Transformer model applied directly to sequences of image patches can achieve state-of-the-art results on image classification when pre-trained on large-scale datasets. This simple and scalable approach requires substantially less computational resources for pre-training than leading convolutional networks. While the lack of inherent image biases requires large amounts of data for effective training, the results show that large-scale pre-training can overcome this limitation. The work opens up new directions for applying Transformers to other vision tasks (detection, segmentation) and exploring self-supervised pre-training methods for ViT. The findings suggest that further scaling of both model and dataset size is a promising path forward.