GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Metadata

Reading status: read complete
Year: 2021
Compute regime: Sparse and memory-efficient scaling (sparse_memory_efficient_scaling)
PDF: 2021-glam_2021.pdf
Extracted text: 2021-glam_2021.txt
PDF URL: https://arxiv.org/pdf/2112.06905.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper gives an unusually concrete training setup. The largest model, GLaM 64B/64E, was trained on 1,024 Cloud TPU-v4 chips, with large-model weights and computation partitioned by a 2D sharding algorithm implemented through GSPMD. Training uses float32 model weights and bfloat16 activations, so the accelerator fit strategy is not just "more chips"; it is a precision and compiler layout story. The optimizer is Adafactor, sequence length is 1024 tokens, and examples are packed so a batch can contain up to 1 million tokens. The largest row in the architecture table has 1.2T total parameters, 96.6B activated parameters per token, 64 Transformer layers, and 64 experts.

The energy numbers are also stated rather than inferred. The full 600B-token GLaM 64B/64E training run is reported at 456 MWh, compared with 1287 MWh for GPT-3. For reaching similar and slightly better scores than GPT-3, the paper reports 574 hours on 1,024 TPU-v4 chips for 280B tokens, consuming 213 MWh. No GPU-era inference hardware is specified; the concrete device claims are TPU-v4 training plus GSPMD partitioning.

Bottleneck

The bottleneck is dense scaling's coupling of memory capacity, training compute, and serving cost. A dense 175B-parameter model like GPT-3 must touch the same parameter set for every token, so increasing total capacity directly increases FLOPs per token. GLaM targets the case where quality benefits from a much larger parameter reservoir, but latency and energy are governed by the active subnetwork per token.

A second bottleneck is distributed systems utilization. Sparse MoE moves the hard part into routing, expert load balancing, and partitioning expert weights across a TPU device mesh. If tokens cluster on the same experts, devices overload and batch utilization falls; if experts are large, the expert weights themselves must be sharded. GLaM is therefore a compute-structure paper: it asks how to spend trillion-parameter memory without paying trillion-parameter FLOPs on every prediction.

Method Adaptation

GLaM adapts a decoder-only Transformer by replacing every other feed-forward layer with a sparsely activated MoE layer. Each token is routed to the top two experts out of 64, and only those expert feed-forward networks are evaluated for that token. This preserves a dense Transformer backbone while making the expensive feed-forward path conditional on the token. The result is a model with 1.2T stored parameters but about 8% of them active per token.

The method is tuned to the memory and parallelism constraints stated in the source. Top-2 routing is an explicit quality/efficiency tradeoff: more selected experts would raise per-token FLOPs and make the network denser, while fewer selected experts would reduce compute but may lose modeling quality. The auxiliary MoE loss from GShard encourages the gating function to distribute tokens evenly across experts, which is a direct defense against device imbalance. The 2D sharding approach partitions weights and computation across batch, model, and expert dimensions, allowing large experts to span multiple TPU cores without duplicating all data or compute. Bfloat16 activations lower activation bandwidth and memory pressure, while float32 weights keep the large sparse model numerically conservative.

Evidence

The headline evidence is the GPT-3 comparison table. GLaM uses 180 GFLOPs per token versus 350 GFLOPs per token for GPT-3, a 48.6% reduction, while the paper reports better average zero-shot accuracy across the benchmark suite shown in that table. The same table reports 456 MWh train energy for GLaM versus 1287 MWh for GPT-3. The architecture table gives the mechanism behind those numbers: 1.2T total parameters but only 96.6B active parameters per token.

The scaling experiments support the compute argument rather than only the final score. For fixed effective FLOPs per prediction, adding more experts generally improves performance, suggesting that sparse memory capacity can buy quality without proportionally increasing active FLOPs. The paper also compares dense and MoE GLaM variants trained on the same data and shows sparse models achieving stronger NLG and NLU curves for similar effective compute. The authors emphasize that the 280B-token GLaM run can reach or slightly exceed GPT-3-level scores at 213 MWh, which ties benchmark quality to a measured device-energy claim.

Historical Effect

GLaM is a major sparse-scaling waypoint between GShard/Switch-style MoE research and later production-grade MoE LLMs. It made a strong version of the sparse-scaling proposition concrete: total parameter count can grow into the trillion range while active inference compute stays closer to a much smaller dense model. It also helped move the evaluation conversation from parameter count alone to active parameters, FLOPs per token, and training energy.

Historically, this card belongs in the sparse and memory-efficient scaling regime because its central claim is not that the TPU cluster is larger, but that the architecture uses the cluster differently: dense scaling spends memory and compute together, while GLaM separates them through sparse activation and compiler-supported sharding.

Limits

The limits are mostly systems limits. Serving GLaM still requires storing and moving a 1.2T-parameter model, even if each token activates only 96.6B parameters. MoE introduces routing decisions, auxiliary balancing losses, expert-capacity edge cases, and cross-device communication paths that dense Transformers avoid. The paper's strongest hardware evidence is for TPU-v4 plus GSPMD; portability to other interconnects or serving stacks is not established in the card sources.

The comparison to GPT-3 is not a same-lab rerun; it relies on published GPT-3 and Patterson energy estimates. Still, within the paper's evidence, sparse activation converts accelerator memory capacity into quality without activating all parameters for every token.