Sequence to Sequence Learning with Neural Networks

Metadata

Slug: seq2seq_2014
Year: 2014
Venue: NeurIPS
Authors: Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper gives a clear implementation-scale description but does not name the exact GPU model. It first reports that a C++ deep-LSTM implementation with the paper's configuration processes about 1,700 words per second on a single GPU, which the authors judged too slow. They therefore parallelized on an 8-GPU machine. Four GPUs hold the four LSTM layers, one layer per GPU, and pass activations to the next layer as soon as they are computed. The remaining four GPUs parallelize the softmax, with each GPU multiplying by a 1000 x 20000 matrix.

This 8-GPU implementation reaches about 6,300 English and French words per second with minibatch size 128, and training takes about 10 days. Since the paper says "single GPU" and "8-GPU machine" but not the device model, the exact chip is not source-supported. Under the project rule, the device era is a 2014 pre-TPU deep-learning GPU workstation regime, but the card should not claim a specific NVIDIA model.

Bottleneck

The bottleneck is the combination of sequential recurrent computation, deep stacked state, and a large target softmax. The model uses 4 LSTM layers, 1000 cells per layer, 1000-dimensional word embeddings, a 160,000-word source vocabulary, and an 80,000-word target vocabulary. The resulting LSTM has 384M parameters, of which 64M are recurrent connections. At every output step, the naive softmax ranges over 80,000 words, so the output layer is large enough to consume half of the 8-GPU machine.

Variable-length sentences create a second compute bottleneck. If a minibatch of 128 randomly mixes short and long sentences, the long examples force padding and much of the recurrent work is wasted. Decoding also has a serial control-flow cost: beam search expands partial hypotheses one token at a time, though the paper finds beam size 2 provides most of the benefit for the ensemble.

Method Adaptation

The method adapts to this GPU regime through manual model partitioning and data-shape control. The four LSTM layers are placed across four GPUs, exposing layer parallelism even though each sequence remains time-serial. The softmax is split across four more GPUs, directly addressing the 80,000-class output matrix. Sentence bucketing is a compute optimization: the authors ensure all sentences in a minibatch are roughly the same length and report a 2x speedup by avoiding wasted padding computation.

The most famous modeling trick, reversing the source sentence, is also compute-structural. It does not change hardware, but it changes the optimization path through time. The paper argues that reversal creates short-term dependencies between early source words and early target words, reducing the minimal time lag that backpropagation must bridge. This made the recurrent training budget much more effective. Gradient clipping is another adaptation to the recurrent compute path: the norm of the batch gradient, divided by 128, is clipped at 5 to control exploding gradients.

The training recipe is deliberately simple: SGD without momentum, learning rate 0.7, halving after 5 epochs every half epoch, and 7.5 total epochs. In this era, the architectural contribution and the hardware layout are tightly coupled; the model works because it is large enough to exploit GPUs but structured enough to fit across them.

Evidence

The data scale is WMT'14 English-to-French: 12M selected sentence pairs, 348M French words, and 304M English words. The model uses fixed vocabularies and maps out-of-vocabulary words to UNK, which limits quality but keeps the softmax finite.

The source-reversal evidence is concrete. Reversing source sentences drops test perplexity from 5.8 to 4.7 and raises decoded BLEU from 25.9 to 30.6 for the single LSTM comparison. Depth also pays: the authors report that each additional LSTM layer reduces perplexity by nearly 10%, which motivates spending the multi-GPU budget on four stacked layers.

Translation results show both the promise and the cost. A single forward LSTM with beam size 12 scores 26.17 BLEU, while a single reversed LSTM scores 30.59. An ensemble of five reversed LSTMs reaches 34.81 BLEU with beam size 12, exceeding the phrase-based baseline at 33.30. Rescoring the baseline 1000-best list with an ensemble of five reversed LSTMs reaches 36.5 BLEU, within 0.5 BLEU of the best WMT'14 result cited in the paper. The authors note that an ensemble of five LSTMs with beam size 2 is cheaper than a single LSTM with beam size 12, showing that inference budget already shapes evaluation choices.

Historical Effect

Seq2seq made end-to-end neural machine translation credible at large WMT scale, but its compute profile also shows why the field moved on. It required manual GPU partitioning, large softmax sharding, bucketing, source reversal, and about 10 days of training on 8 GPUs. It is the canonical recurrent GPU-era translation card: enough accelerator compute to replace hand-built translation pipelines, but not yet enough parallelism to escape sequence-by-sequence recurrence.

Limits

The exact GPU model is not listed. The recurrent architecture remains sequential across tokens, limiting both training and decoding parallelism. The large vocabulary softmax is expensive enough to need four GPUs, and the fixed-vocabulary UNK handling leaves rare words unresolved. Source reversal is an optimization workaround for dependency length, not a general alignment mechanism. Later attention models and Transformers address these limits more directly.