Neural Machine Translation by Jointly Learning to Align and Translate

Metadata

Slug: neural_machine_translation_align_2014
Year: 2014
Venue: ICLR
Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper lists specific GPUs and wall-clock training time in Appendix Table 2. RNNenc-30 uses TITAN BLACK for 8.46e5 updates, 6.4 epochs, and 109 hours. RNNsearch-30 also uses TITAN BLACK for 4.71e5 updates, 3.6 epochs, and 113 hours. RNNenc-50 uses Quadro K-6000 for 6.00e5 updates, 4.5 epochs, and 108 hours. RNNsearch-50 uses Quadro K-6000 for 2.88e5 updates, 2.2 epochs, and 111 hours, and the longer RNNsearch-50? run uses Quadro K-6000 for 6.67e5 updates, 5.0 epochs, and 252 hours. The main text summarizes this as approximately five days per model.

The training loop is single-GPU recurrent sequence training. Each SGD update uses a minibatch of 80 sentences with Adadelta. The WMT 2014 English-French corpus totals 850M words before data selection and is reduced to 348M words. The model uses 30,000-word source and target shortlists, 1000-unit recurrent hidden states, 620-dimensional embeddings, and a 500-unit maxout hidden layer in the deep output. This is not yet the wide data-parallel TPU/GPU setting; the dominant costs are recurrent unrolls, vocabulary softmax work, and wasted padding from uneven sentence lengths.

Bottleneck

The conceptual bottleneck is the fixed-length context vector in encoder-decoder NMT. The paper states that a neural network must compress all necessary information of a source sentence into one vector, making long sentences difficult; prior work showed encoder-decoder performance deteriorated rapidly as input length increased. This is a memory bottleneck in representation space: no matter how much information the source contains, the decoder receives one fixed summary.

The compute bottleneck is recurrent sequence length. The appendix says each update requires time proportional to the length of the longest sentence in a minibatch. That means one long example can force the GPU to process extra recurrent steps for the whole batch. Unlike later self-attention models, recurrence also prevents full parallelism across positions. RNNsearch adds per-target-step alignment over source positions, so it spends more compute to avoid the fixed-vector bottleneck, but it remains constrained by sequential decoding and sentence-length-dependent work.

Method Adaptation

The attention/alignment mechanism adapts to this GPU setting by spending extra compute on dynamic memory access rather than forcing all source information through one fixed vector. A bidirectional encoder creates a sequence of annotations. At each generated target word, the decoder computes a soft alignment over source annotations and forms a context vector as a weighted sum. The paper emphasizes that this relieves the encoder from encoding all information into one fixed-length vector; information can be spread through the sequence and selectively retrieved.

The implementation also adapts the data pipeline to sequence hardware waste. Before every 20th update, it retrieves 1600 sentence pairs, sorts them by length, and splits them into 20 minibatches, minimizing wasted computation caused by the longest sentence in the batch. Gradient L2 norm is clipped to at most 1, and Adadelta adapts each parameter's learning rate. At inference, beam search spends extra test-time compute to approximate the best translation. These choices are compute-aware: length sorting improves GPU utilization, clipping stabilizes long recurrent unrolls, and beam search trades latency for BLEU.

Evidence

The evidence ties translation quality to both the alignment mechanism and the five-day single-GPU training regime. Table 1 reports BLEU scores on WMT 2014 English-French: RNNencdec-30 gets 13.93 on all sentences and 24.19 on no-UNK sentences, while RNNsearch-30 gets 21.50 and 31.44. RNNencdec-50 gets 17.82 and 26.71, while RNNsearch-50 gets 26.75 and 34.16. The longer RNNsearch-50? reaches 28.45 and 36.15, close to or above Moses on no-UNK sentences, where Moses is 35.63, though Moses remains higher on all sentences at 33.30.

Length evidence is central. Figure 2 plots BLEU by sentence length, and the text says RNNencdec performance dramatically drops as sentence length increases, while RNNsearch-30 and RNNsearch-50 are more robust. RNNsearch-50 shows no deterioration even with sentences of length 50 or more, and RNNsearch-30 even outperforms RNNencdec-50. Qualitative examples show RNNencdec drifting after about 30 words, while RNNsearch preserves long-sentence meaning. Appendix Table 2 then grounds these improvements in training scale: the better long-run model required 252 hours on a Quadro K-6000.

Historical Effect

This paper introduced learned attention as a compute allocation mechanism for sequence models: the decoder dynamically chooses which source states to read instead of relying on a single bottleneck vector. In compute-structure terms, it converts a fixed memory interface into a content-addressed per-token read. It is a direct predecessor of the Transformer, where attention becomes the main parallelizable computation rather than an add-on around recurrent state. The card also marks a 2014 moment when single-GPU multi-day runs could train useful NMT models, but full sequence parallelism had not yet arrived.

Limits

The model is still recurrent and therefore not fully parallel across source and target positions. Attention improves memory access but adds an alignment computation evaluated across source-target position pairs; the appendix notes the alignment model must be designed with the need to evaluate it Tx * Ty times per sentence pair. The vocabulary is limited to 30,000 words per language and unknown tokens remain a major quality issue. Hardware scale is also narrow: even the strongest run is one Quadro K-6000 for 252 hours, not a distributed training system. Attention solves the fixed-vector bottleneck, but not yet the accelerator-wide parallelism bottleneck that the Transformer later targets.