Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Metadata

Slug: moe_2017
Year: 2017
Venue: ICLR
Authors: Noam Shazeer et al.
Reading status: read complete
Compute regime: Sparse and memory-efficient scaling
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly uses GPU clusters. The language-model experiments use clusters of 16-32 Tesla K40 GPUs. The 100B-word Google News experiments use 32 Tesla K40 GPUs for most models, then 64 and 128 K40 GPUs for the largest two models so the parameters fit in memory. The machine-translation experiments train synchronously on up to 64 GPUs, with each GPU processing sentence pairs containing about 16,000 words. The GNMT comparisons in the translation tables use 96 K80 GPUs, and the multilingual GNMT baseline uses 96 K20s.

The important setup detail is that the paper is not simply adding parameters to a single dense GPU model. Dense recurrent layers remain conventional, while MoE experts are model-parallel shards distributed across devices. For a batch on each of d devices, the MoE layer combines the data-parallel batches so each expert sees roughly d times more examples than it would in a naive per-device implementation.

Bottleneck

The bottleneck is that conditional computation does not naturally match GPU hardware. The paper says modern devices, especially GPUs, are fast at arithmetic but less friendly to branching, and that large batches are critical because they amortize parameter transfers and updates. A naive MoE splits a batch across many experts, so each expert receives only about k*b/n examples for top-k routing over n experts. That makes tiny matrix multiplications and poor GPU occupancy exactly when the model is trying to scale capacity.

Network bandwidth is the other bottleneck. The experts are stationary on devices, so communication mostly consists of sending inputs to the selected experts and returning outputs. The paper notes that a GPU cluster's arithmetic capacity can exceed aggregate inter-device bandwidth by thousands of times. To remain compute-limited rather than network-limited, each expert must perform enough arithmetic per byte of input/output. This explains why the experts are one-hidden-layer feed-forward networks with thousands of ReLU units rather than tiny branches.

Memory also constrains the batch. Larger batches make experts efficient, but activations must be stored for backpropagation. At Google News scale, the paper introduces activation recomputation and a factored second-moment optimizer state to fit up to about 1B parameters per GPU.

Method Adaptation

The sparsely gated MoE adapts conditional computation to GPU clusters by routing each example to a small number of experts with noisy top-k gating, while keeping the number of active experts per input small. In many language-model experiments k=4; in some translation MoEs k=2. The result is a layer with enormous parameter capacity but a per-token compute path closer to a dense layer of fixed size.

The implementation is explicitly hybrid parallel. Non-MoE layers can be replicated data-parallel, while experts are sharded model-parallel. Synchronous batches from many devices are merged at the MoE layer to make expert sub-batches large enough for efficient GPU matmuls. As the number of experts grows, the number of devices can grow so each expert's batch size and per-device memory stay manageable.

The gating losses are hardware-motivated as much as statistical. If the gate collapses onto a few experts, those devices overload, expert batches become imbalanced, and most capacity is wasted. The importance and load-balancing losses keep examples spread across experts. In some translation experiments, the authors go further with a batchwise mask so every expert receives the same batch size during training, then approximate that behavior at inference using per-expert thresholds.

Evidence

The paper's headline claim is greater than 1000x improvement in model capacity with only minor losses in computational efficiency on modern GPU clusters. The measured efficiency evidence is mixed but concrete: low-computation MoE models run at 0.74-0.90 TFLOPS/GPU, the highest-computation MoE reaches 1.56 TFLOPS/GPU, and the largest Google News model falls to 0.30 TFLOPS/GPU when batch size is not increased proportionally with GPU count. That last result is useful because it exposes the batch-size bottleneck rather than hiding it.

On the 100B-word Google News dataset, the 68.9B-parameter MoE-65536-h model reaches 28.9 test perplexity after one epoch at about 0.72 observed TFLOPS/GPU, compared with 47.0 for the 4xLSTM-512 baseline. The 137.7B-parameter MoE-131072-h model worsens slightly to 29.2 and drops to 0.30 TFLOPS/GPU, suggesting too much sparsity or too-small expert batches at that scale.

On WMT14 English-French, a 2048-expert MoE with 8.7B total parameters reaches BLEU 40.56 after 6 days on 64 K40s, compared with GNMT+RL at 39.92 after 6 days on 96 K80s. On WMT14 English-German, the same scale reaches BLEU 26.03 after 1 day on 64 K40s, versus GNMT at 24.91 on 96 K80s. In multilingual translation, the MoE model trains 12 days on 64 K40s, has 8.7B parameters and 102M ops/timestep, and beats the multilingual GNMT baseline on 11 of 12 language pairs.

Historical Effect

This paper made conditional compute practical on GPU clusters and opened the sparse-capacity branch: increase parameter count without increasing per-token FLOPs proportionally. Its historical contribution is not just "use experts," but the specific compute recipe that made experts trainable: large shared batches, expert sharding, load balancing, and communication-aware expert size.

Later MoE systems inherit this structure. The line from this paper to GShard, Switch Transformer, and sparse LLMs is a line about capacity under memory and bandwidth constraints. The core promise is that parameter memory can scale with devices while per-token arithmetic grows much more slowly.

Limits

Efficiency falls at extreme sparsity. The largest Google News model has more parameters, but worse perplexity and much lower TFLOPS/GPU than the 68.9B model. The paper attributes part of the low efficiency to not increasing batch size proportionally with the number of GPUs, which reinforces that MoE scaling depends on large enough expert sub-batches.

The method also needs careful load balancing and a communication-friendly expert shape. If experts are too small, network traffic dominates; if routing collapses, some devices overload and others idle. Inference is not free either: batchwise training masks require approximation when large training batches are unavailable at serving time. The paper demonstrates sparse capacity, but not a universal replacement for dense scaling.