TPU and accelerator Transformer era

Accelerators reward large matrix multiplies and sequence batching. Transformers, BERT, and T5 fit this structure.

12 papers Regime 4 of 10

TPU and accelerator Transformer era

Device/setup

Datacenter accelerators, especially TPU/P100-class systems, with high-throughput matrix units and compiler/runtime sharding.

Bottleneck

Sequence length, attention memory, accelerator utilization, and partitioning very large dense models across pods and supercomputers.

Methods that fit

Transformers, BERT-style pretraining, TPU benchmarking, GPipe, Mesh-TensorFlow, T5, ViT, ALBERT, Transformer-XL, RoBERTa, XLNet, and attention analysis all reflect accelerator-native dense training.

Methods that became obsolete or less central

RNN-heavy sequence modeling and manually partitioned one-off distributed code became less central for mainstream language and vision models.

Representative papers

Rank Year Paper Priority Status
33 2017 Attention Is All You Need 10 downloaded / read_complete
34 2017 In-Datacenter Performance Analysis of a Tensor Processing Unit 5 downloaded / read_complete
35 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 5 downloaded / read_complete
36 2018 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 5 downloaded / read_complete
37 2018 Mesh-TensorFlow: Deep Learning for Supercomputers 5 downloaded / read_complete
38 2020 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 5 downloaded / read_complete
39 2019 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 4 downloaded / read_complete
40 2019 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 3 downloaded / read_complete
41 2019 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 3 downloaded / read_complete
42 2019 RoBERTa: A Robustly Optimized BERT Pretraining Approach 2 downloaded / read_complete
43 2019 XLNet: Generalized Autoregressive Pretraining for Language Understanding 2 downloaded / read_complete
44 2019 What Does BERT Look at? An Analysis of BERT's Attention 1 downloaded / read_complete

Open questions

  • Clarify how accelerator topology shaped the shift from task-specific networks to pretrain-and-transfer models.

Papers in this compute regime 12