TPU、加速器与 Transformer 时代

加速器天然适合大规模矩阵乘法和序列批处理。Transformer、BERT 和 T5 顺应了这一硬件结构。

12 篇论文 第 4 个,共 10 个计算范式

TPU and accelerator Transformer era

英文原文文件:README.md

设备/设置

数据中心加速器,尤其是 TPU/P100 级系统,具备高吞吐矩阵单元和编译器/运行时分片。

瓶颈

序列长度、注意力内存、加速器利用率,以及在 pod/超级计算机上分割超大密集模型。

适配的方法

Transformer、BERT 预训练、TPU 基准、GPipe、Mesh-TensorFlow、T5、ViT、ALBERT、Transformer-XL、RoBERTa、XLNet 和注意力分析体现了加速器原生密集训练。

变得过时或不再中心的方法

RNN-heavy 序列建模和一次性手工分布式代码在主流语言/视觉模型中不再占据核心地位。

代表性论文

排名 年份 论文 优先级 状态
33 2017 Attention Is All You Need 10 downloaded / read_complete
34 2017 In-Datacenter Performance Analysis of a Tensor Processing Unit 5 downloaded / read_complete
35 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 5 downloaded / read_complete
36 2018 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 5 downloaded / read_complete
37 2018 Mesh-TensorFlow: Deep Learning for Supercomputers 5 downloaded / read_complete
38 2020 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 5 downloaded / read_complete
39 2019 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 4 downloaded / read_complete
40 2019 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 3 downloaded / read_complete
41 2019 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 3 downloaded / read_complete
42 2019 RoBERTa: A Robustly Optimized BERT Pretraining Approach 2 downloaded / read_complete
43 2019 XLNet: Generalized Autoregressive Pretraining for Language Understanding 2 downloaded / read_complete
44 2019 What Does BERT Look at? An Analysis of BERT's Attention 1 downloaded / read_complete

开放问题

  • 厘清加速器拓扑如何推动从任务专用网络向预训练-迁移范式的转变。

相关论文 12