Mesh-TensorFlow: Deep Learning for Supercomputers

Metadata

Slug: mesh_tensorflow_2018
Year: 2018
Venue: NeurIPS
Authors: Noam Shazeer et al.
Reading status: read complete
Compute regime: TPU and accelerator Transformer era
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly targets TPU meshes up to 512 cores. It describes a 512-core TPU cluster with a 16x16x2 toroidal network interconnect, and notes that the same physical cluster can be represented as logical meshes such as 16x16x2, 32x16, or 512. The experiments use two-dimensional TPUv2 meshes up to 16x32, for 512 cores.

The largest reported language-model experiment trains a 4.9B-parameter Transformer for 13 hours on a 512-core TPUv2 cluster. The largest translation experiment trains a 2.9B-parameter model for 22 hours on a 128-core TPUv2 cluster. The paper also reports more than 50% computational efficiency on the largest models: 6 PFLOP/s out of a maximum 11.5 PFLOP/s.

Bottleneck

Pure data parallelism hits memory constraints and synchronization cost. Each processor holds a full copy of the parameters, and operations that reduce out the batch dimension require Allreduce over parameter-sized tensors. For very large Transformers, parameters and activations can exceed per-core memory, and the time needed to synchronize replicated parameters can make pure batch splitting inefficient.

The alternative, model parallelism, is necessary but hard to program manually. Transformer dimensions such as vocabulary, feed-forward hidden size, and attention heads can be split across devices, but the right choice depends on tensor shapes, batch size, memory, and physical network topology. The paper stresses that the logical mesh is only a naming abstraction: physical topology still affects performance, especially collective communication such as MPI Allreduce.

Method Adaptation

Mesh TensorFlow adapts TensorFlow-style models to accelerator meshes by:

Giving tensor dimensions names.
Giving processor-mesh dimensions names.
Mapping logical tensor dimensions to mesh dimensions.
Compiling SPMD programs with collectives such as Allreduce.
Splitting Transformer dimensions such as vocabulary, feed-forward width, and attention heads across mesh axes.

This turns tensor layout into an explicit part of the model definition. The critical abstraction is the mapping from named tensor dimensions to named mesh dimensions. Data parallelism is one layout, with batch mapped to the processor mesh. Model parallelism is another, with a hidden or vocabulary dimension mapped to the mesh. Hybrid parallelism uses a two-dimensional mesh, for example mapping batch to rows while mapping vocabulary, feed-forward width, and attention heads to columns.

That mapping is a compute-device adaptation because it lets the programmer express memory and communication tradeoffs without hand-writing a different program for each core. If an einsum reduces out a dimension that is split across the mesh, the compiler inserts collectives. If dimensions are split the same way across adjacent tensors, data movement can be avoided. If a tensor dimension is not mapped, it is replicated. The paper's examples show how communication volume changes depending on whether batch, hidden, input/output, or contracting dimensions are split.

For Transformers, the paper scales model dimensions and cluster dimensions together. Batch is split along one TPU mesh dimension, while vocabulary, feed-forward hidden dimension, and attention heads are split along the other. This layout maintains performance if batch size scales with the data-parallel axis and the split model dimensions scale with the model-parallel axis. The method is therefore a way to keep per-core memory and per-core arithmetic reasonable while growing both model width and the processor count.

Evidence

The scaling evidence is concrete. Using the two-dimensional TPUv2 layout, the authors train Transformer models with feed-forward hidden dimensions up to 262144 and up to 256 attention heads on meshes up to 16x32 cores. They report more than 50% efficiency, specifically 6 PFLOP/s out of 11.5 PFLOP/s, on the largest models.

For the One Billion Word language-model benchmark, the largest model has 4.9B parameters and trains for 10 epochs in 13 hours on 512 TPUv2 cores. The batch size for all models is 256 sequences of 256 tokens. The batch is split along the mesh dimension of size 16, and model dimensions are split along the mesh dimension of size 32. The largest model reports per-word development perplexity 24.0, dropping to 23.5 when logits are multiplied by 0.9 for evaluation.

For WMT14 English-French translation, the largest model has 2.9B parameters and trains for 3 epochs in 22 hours on 128 TPUv2 cores. It reaches BLEU 43.9 using sacrebleu, described by the paper as the best published result to date. Table 3 shows that model size grows as feed-forward width and attention heads grow: 0.15B parameters at d_ff 2048 and 4 heads reaches 41.8 En-Fr BLEU, while 2.89B parameters at d_ff 65536 and 128 heads reaches 43.9.

The system evidence is the cost table and examples. The paper compares layouts by computation, communication, and memory cost, and explains that if the per-processor batch is too small, performance suffers; if the hidden layer is sliced too finely, performance also suffers. Mesh TensorFlow's contribution is to expose these layout decisions rather than hide them inside fixed data-parallel training.

Historical Effect

Mesh TensorFlow made layout a first-class scaling abstraction. It is the systems bridge from single-machine Transformers to TPU-Pod language-model training: the model definition, tensor dimensions, processor mesh, and collective communication become one design surface. Later TPU-scale work such as T5, GShard, and Switch Transformer inherits the idea that model architecture and mesh layout must be co-designed.

Historically, this is an important step away from "just increase batch size" scaling. It shows that language-model quality can improve by increasing model width and attention heads when the layout keeps memory and communication under control. It also shifts responsibility from handwritten model-parallel code toward compiler-generated SPMD programs.

Limits

It assumes identical, reliable processors.
Tensor dimensions must divide cleanly by mesh dimensions.
Physical topology still affects collective communication.
Automated layout search is left as future work.
Bad layouts can replicate computation or communicate too much; the abstraction exposes layout but does not automatically solve every layout choice.
The results are tied to TPUv2 mesh collectives and TensorFlow/XLA-era infrastructure, so the exact efficiency claims should not be generalized to every accelerator topology.