An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Download PDF

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Metadata

  • Slug: vit_2020
  • Year: 2020
  • Venue: ICLR
  • Authors: Alexey Dosovitskiy et al.
  • Reading status: read complete
  • Compute regime: TPU and accelerator Transformer era
  • Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states that all models in the main state-of-the-art comparison were trained on TPUv3 hardware, and it reports pretraining cost in TPUv3-core-days. It defines that measure as TPU v3 cores, two per chip, multiplied by training time in days. ViT-L/16 pretrained on ImageNet-21k is reported as trainable on a standard cloud TPUv3 with 8 cores in about 30 days.

The model and data scale are also concrete. ViT-Base has 12 layers, hidden size 768, MLP size 3072, 12 heads, and 86M parameters. ViT-Large has 24 layers, hidden size 1024, MLP size 4096, 16 heads, and 307M parameters. ViT-Huge has 32 layers, hidden size 1280, MLP size 5120, 16 heads, and 632M parameters. The pretraining datasets are ImageNet with 1.3M images, ImageNet-21k with 14M images, and JFT-300M with 303M high-resolution images. Pretraining uses Adam, batch size 4096, high weight decay 0.1, and linear learning-rate warmup/decay; fine-tuning uses SGD with momentum and batch size 512.

Bottleneck

The compute bottleneck is how to make image recognition look like dense accelerator-friendly Transformer computation without exploding sequence length. A naive pixel-level self-attention model would require every pixel to attend to every other pixel. ViT instead makes the sequence length N = HW / P^2, where P is patch size. This means smaller patches improve spatial detail but increase sequence length and attention cost quadratically.

Memory is the practical constraint during higher-resolution fine-tuning. The paper keeps patch size fixed when input resolution rises, which increases effective sequence length; it states that ViT can handle arbitrary sequence lengths up to memory constraints. Data is the other bottleneck. With fewer built-in image priors than CNNs, ViT-Large underperforms ViT-Base when pretrained only on ImageNet. Only with larger pretraining, especially JFT-300M, does the larger model show its benefit.

Method Adaptation

ViT adapts vision to accelerator structure by turning an image into a regular token sequence. The image is split into fixed-size patches, each patch is flattened and linearly projected to the Transformer width, a learned class token is prepended, and learned 1D position embeddings are added. The encoder is otherwise a standard Transformer: multihead self-attention, MLP blocks, layernorm, and residual connections. This preserves large dense matrix multiplications, the workload TPUs handle well.

The architecture deliberately injects very little image-specific structure. Locality and translation equivariance are built into CNN layers, but ViT uses 2D structure mainly at patch extraction and during position-embedding interpolation for new resolutions. That choice is compute-relevant: it avoids specialized attention patterns whose implementations may be slower even if their theoretical FLOPs look attractive.

The paper tests this point with axial attention variants. Axial-ViT can improve accuracy/compute in some plots, but the open implementation is described as very slow on TPUs, and the authors say extensive large-scale experiments may require a carefully optimized implementation. This supports the card's main compute point: the useful method is not just lower asymptotic complexity, but alignment with the actual accelerator software stack.

Evidence

The headline comparison is both accuracy and compute. ViT-H/14 pretrained on JFT-300M reaches 88.55% ImageNet, 90.72% ImageNet-ReaL, 94.55% CIFAR-100, and 77.63 on VTAB, using about 2.5K TPUv3-core-days. ViT-L/16 on JFT reaches 87.76% ImageNet and 76.28 VTAB with 0.68K TPUv3-core-days. ViT-L/16 on public ImageNet-21k reaches 85.30% ImageNet and 72.72 VTAB with 0.23K TPUv3-core-days. The compared BiT-L ResNet152x4 uses 9.9K TPUv3-core-days, and Noisy Student EfficientNet-L2 uses 12.3K.

The controlled compute study states that Vision Transformers dominate ResNets on the performance/pretraining-compute tradeoff, using approximately 2-4x less compute to reach the same average performance over five datasets. The appendix timing section checks real hardware speed on TPUv3 because FLOPs do not always predict wall-clock speed. It reports that ViT models have speed comparable to similar ResNets and that large ViT models have a clear memory-efficiency advantage, fitting larger per-core batches across input sizes.

Historical Effect

ViT moved vision into the Transformer accelerator regime. It showed that, with enough data and TPU-scale pretraining, a mostly vanilla dense Transformer could compete with or beat highly engineered CNN baselines. The historical effect is not only "attention works for images"; it is that image models could be expressed in the same matrix-multiply-heavy, large-batch, large-dataset compute language as NLP Transformers.

Limits

ViT depends heavily on pretraining data scale. On ImageNet-only pretraining, larger ViTs can underperform smaller ones despite regularization. Smaller patch sizes increase sequence length and compute. Higher-resolution fine-tuning is bounded by memory. Efficiency comparisons are affected by optimizer, schedule, weight decay, and implementation quality. Specialized attention may offer accuracy/compute benefits, but the paper's TPU experiments show that unoptimized variants can be too slow for large-scale use.

Links