Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Metadata

Slug: gnmt_2016
Year: 2016
Venue: arXiv
Authors: Yonghui Wu et al.
Reading status: read complete
Compute regime: Multi-GPU dense training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists both training and inference devices. GNMT trains in TensorFlow with data parallelism across replicas and model parallelism inside each replica. A model replica is partitioned 8 ways across 8 GPUs, typically on one host machine. The architecture in that setup has 8 encoder LSTM layers, with the bottom encoder layer bidirectional and the next 7 unidirectional, plus 8 decoder LSTM layers. The paper also states that the softmax layer is partitioned across multiple GPUs, either sharing the encoder/decoder GPUs or using dedicated GPUs depending on vocabulary size.

At the training-system level, the paper says the classic data-parallel setup uses 12 replicas running concurrently on separate machines, with asynchronous updates to shared parameters. In the model-parallelism description, n is often around 10 replicas and each replica uses a minibatch of 128 sentence pairs. For WMT'14 English-to-French, the concrete scale is around 6 days to train a basic model using 96 NVIDIA K80 GPUs.

Inference is described on a single machine with two Intel Haswell CPUs, totaling 88 hyperthreaded CPU cores. The same machine is equipped either with an NVIDIA Tesla K80 GPU or with a single Google TPU. CPU and GPU decoding use full-precision floats. TPU decoding offloads quantized operations to the TPU, while embedding lookup and the attention module remain on the CPU.

Bottleneck

GNMT sits at the point where recurrent NMT was accurate enough to be production-relevant but expensive enough that training and serving architecture mattered. The training bottleneck is deep recurrent computation plus a large output softmax. LSTM time steps are sequential, so the only practical parallelism is across layers, examples, replicas, and vocabulary partitions. Fully bidirectional encoders would improve context but would reduce layer pipeline parallelism, because later layers would wait for both directions to finish.

The inference bottleneck is different. Beam search performs many small sequential decoding steps and requires host/device coordination. The paper reports that CPU decoding is faster than GPU decoding in their setup, partly because the dual Haswell machine has substantial peak FLOPs and partly because beam search forces nontrivial host-GPU data transfer at every decoding step. This is a device-utilization problem: the GPU has theoretical arithmetic capacity, but the decoding control flow and transfer pattern prevent the implementation from using it well.

Method Adaptation

GNMT adapts the model to the devices in several ways. Residual connections begin at the third layer from the bottom in both encoder and decoder, making the 8-layer stacks trainable. The encoder and decoder are partitioned along depth so different layers run on different GPUs; because most encoder layers are unidirectional, layer i+1 can start before layer i finishes the full sentence. Attention is intentionally connected from the bottom decoder-layer output to the top encoder output. The authors state that aligning the top decoder layer to the top encoder layer would remove decoder parallelism and make more than one GPU unhelpful for decoding.

The output softmax is vocabulary-partitioned, which addresses the same large-softmax pressure already visible in earlier seq2seq work. Wordpieces are another compute adaptation. A 32K shared wordpiece vocabulary keeps sequences much shorter than character models while avoiding the rare-word failure modes of word models. The paper explicitly says 8K-32K wordpieces balance accuracy and fast decoding, and Table 4 shows WPM-32K with the best En->Fr BLEU among the single models.

For serving, quantization-aware training makes TPU inference viable. The production split is not "all TPU"; it is CPU plus TPU, with attention and embeddings still on CPU and the quantized dense recurrent work offloaded.

Evidence

The inference table is the clearest compute result. On the WMT En->Fr development set, CPU decoding takes 1322 seconds with BLEU 31.20 and log perplexity 1.4553. GPU decoding takes 3028 seconds with the same BLEU and log perplexity, showing that the K80 implementation underuses GPU capacity for this beam-search workload. TPU decoding takes 384 seconds, with BLEU 31.21 and log perplexity 1.4626, a 3.4x speedup over CPU and effectively no BLEU loss.

The training procedure also exposes the compute regime. The system clips gradients to norm 5.0, starts with Adam for 60K steps, switches to SGD, uses minibatches of 128, anneals after 1.2M steps on En->Fr, and takes about 6 days on 96 K80s for the basic WMT model.

Translation evidence is tied to the design choices. On WMT En->Fr, WPM-32K reaches 38.95 BLEU averaged over 8 trained models, with a maximum of 39.37, and CPU decoding time of 0.2118 seconds per sentence in the reported table. On WMT En->De, WPM-32K reaches 24.61 BLEU and the fastest listed CPU decoding time among the wordpiece options. In production human side-by-side tests across major language pairs, the paper reports more than 60% reduction in translation errors versus Google's phrase-based production system.

Historical Effect

GNMT is an accelerator-shaped bridge between early single-machine seq2seq and later fully parallel Transformers. It shows recurrent models already needing model parallelism, data parallelism, softmax sharding, quantization, and TPU-oriented serving before attention-only architectures became dominant. Its compute story is also a reminder that training and inference can have opposite hardware bottlenecks.

Limits

The model remains recurrent, so token-level decoding and much of training are sequential. The exact K80 training setup is specified, but many production details remain internal. TPU serving is only partly on the TPU, and the attention module remains on CPU in the described comparison. The paper also notes that reinforcement learning improves BLEU more than human evaluation, so some extra training compute optimizes the metric more than perceived translation quality.