TPU and accelerator Transformer era
Accelerators reward large matrix multiplies and sequence batching. Transformers, BERT, and T5 fit this structure.
12 papers
Regime 4 of 10
TPU and accelerator Transformer era
Device/setup
Datacenter accelerators, especially TPU/P100-class systems, with high-throughput matrix units and compiler/runtime sharding.
Bottleneck
Sequence length, attention memory, accelerator utilization, and partitioning very large dense models across pods and supercomputers.
Methods that fit
Transformers, BERT-style pretraining, TPU benchmarking, GPipe, Mesh-TensorFlow, T5, ViT, ALBERT, Transformer-XL, RoBERTa, XLNet, and attention analysis all reflect accelerator-native dense training.
Methods that became obsolete or less central
RNN-heavy sequence modeling and manually partitioned one-off distributed code became less central for mainstream language and vision models.
Representative papers
Open questions
- Clarify how accelerator topology shaped the shift from task-specific networks to pretrain-and-transfer models.