BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Metadata

Reading status: read complete
Year: 2022
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2022-bloom_2022.pdf
Extracted text: 2022-bloom_2022.txt
PDF URL: https://arxiv.org/pdf/2211.05100.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly reports the training machine. BLOOM was trained on Jean Zay at IDRIS/CNRS, using 48 training nodes with "8 NVIDIA A100 80GB GPUs" per node, for 384 active GPUs, plus four spare nodes kept ready for failures. Each node had two AMD EPYC 7543 32-core CPUs and 512 GB RAM. Storage came from a shared SpectrumScale/GPFS parallel filesystem. Inside a node, GPU communication used four NVLink GPU-to-GPU interconnects; between nodes, each node had four Omni-Path 100 Gbps links in an enhanced hypercube 8D topology.

The training run lasted about 3.5 months and consumed 1,082,990 compute hours. The final BLOOM model has 176,247M parameters, 70 layers, hidden dimension 14,336, 112 attention heads, sequence length 2,048, a 250,680-token vocabulary, and a global batch size of 2,048 sequences. The paper reports 366B total training tokens for BLOOM, starting from the roughly 341B-token ROOTS corpus and then adding about 25B repeated tokens after Chinchilla-style data-scaling results appeared during training. The final run used bfloat16 mixed precision on A100s, not the float16 used in earlier smaller variants and trials.

Bottleneck

The central bottleneck is fitting and sustaining a 176B dense decoder across a public supercomputer rather than a single-owner hyperscale cluster. Parameter memory, gradients, optimizer state, and activations cannot fit in one 80 GB GPU, and the model is too deep and wide for pure data parallelism. BLOOM therefore has a memory-placement problem and a communication problem at the same time: tensor shards need fast intra-layer exchange, pipeline stages need synchronized forward/backward flow, and data-parallel replicas still need optimizer-state synchronization.

The lower-bound arithmetic is already beyond single-device scale. BLOOM's 176.247B parameters imply about 2.82 TB of mixed-precision Adam training state at 16 bytes per parameter, before activations, temporary buffers, and fragmentation. For inference, BF16 weights are about 352.5 GB before KV cache. With 70 layers, 112 attention heads, head dimension 128, and 2048-token context, batch-1 KV cache adds about 8.2 GB. The deployment bound is therefore weights plus cache plus runtime workspace, not parameter storage alone.

The second bottleneck is numerical stability. The paper says 104B-parameter experiments on NVIDIA V100 GPUs in IEEE float16 suffered irreversible divergences, plausibly from float16's limited dynamic range. A100 support for bfloat16 changed the feasible precision regime by preserving float32-like dynamic range while retaining tensor-core throughput.

The third bottleneck is bandwidth and orchestration. GPU math throughput exceeds the speed of repeatedly moving intermediate tensors to and from VRAM, so memory-bound operations such as LayerNorm, masking, softmax, bias addition, and GeLU can waste the A100s if implemented naively. At 384 GPUs, mundane runtime behavior also became a bottleneck: asynchronous CUDA launches complicated debugging and deadlocks, and large parameter groups caused excessive CPU memory allocation.

Method Adaptation

BLOOM adapts the GPT-style dense Transformer recipe to the Jean Zay topology with Megatron-DeepSpeed. Megatron-LM supplies the Transformer, tensor parallelism, and data loading; DeepSpeed supplies ZeRO, pipeline parallelism, and distributed training components. The resulting 3D parallelism combines data parallelism, tensor parallelism, and pipeline parallelism. ZeRO stage 1 shards optimizer states, which is a conservative memory win: it reduces Adam-state replication while avoiding the more intrusive parameter/gradient sharding stages.

Several design choices are explicitly compute-shaped. The paper says MoE was rejected partly because mature, widely used GPU codebases for training them were not available on the project timeline. ALiBi positional embeddings avoided a learned position table and, in the authors' experiments, smoothed training. Embedding LayerNorm was introduced after the 104B instability work. The tokenizer vocabulary was made divisible by 128 for GPU efficiency and by 4 for tensor parallelism; the final size was 250,680 tokens. Fused CUDA kernels from Megatron-LM reduced memory traffic by combining LayerNorm, attention scaling/masking/softmax, and bias-plus-GeLU work so intermediate results could stay closer to registers rather than round-tripping through VRAM.

Operational adaptation mattered too. The final system disabled asynchronous CUDA kernel launches for debuggability and deadlock avoidance, split parameter groups to reduce CPU memory spikes, saved checkpoints every three hours, and relied on spare nodes to absorb hardware failures without major throughput loss.

Evidence

The strongest compute evidence is the measured utilization: the fastest configuration reached 156 TFLOP/s per A100, about half of the paper's 312 TFLOP/s theoretical peak figure for float32 or bfloat16 tensor compute. This is not a single-GPU benchmark; it is achieved inside a 384-GPU 3D-parallel training job with pipeline bubbles, tensor-parallel collectives, data movement, dataloader behavior, and checkpointing in the loop.

The stability evidence is also concrete. Earlier 104B V100/float16 trials diverged irreversibly, while final A100 bfloat16 mixed-precision training "proved to solve" that instability in the authors' account. Hardware failures were routine but manageable: the paper reports one to two GPU failures per week, automatic use of backup nodes, and only sparse engineering issues. A PyTorch dataloader deadlock and disk-space issues caused 5-10 hour downtimes, but there was only one loss spike and the model recovered.

Benchmark evidence shows what this compute bought. On one-shot SuperGLUE comparisons, BLOOM-176B was ahead of OPT-175B on Ax-b, CB, WSC, and WiC and matched it elsewhere in the reported subset. On WMT14, the best one-shot prompt gave BLOOM 34.22 BLEU for English-to-French and 35.42 for French-to-English. On Flores-101, the paper reports strong one-shot translation across high-resource and high-to-mid-resource pairs, with poor results for underrepresented pairs such as Swahili-Yoruba. HumanEval performance was similar to comparable Pile-trained GPT models, while code-finetuned Codex remained much stronger.

Historical Effect

BLOOM showed that a 100B+ multilingual dense LLM could be built as an open-science supercomputing project rather than only as a closed industrial platform. Historically, its importance is not just the 176B parameter count; it exposed the full compute structure of a multilingual frontier-scale run: public grant allocation, exact A100 topology, 3D parallelism, BF16 migration, tokenizer divisibility constraints, checkpoint cadence, failure handling, carbon accounting, and post-training evaluation.

It also made the "open large model" question operational. Releasing weights and documentation matters only if other groups can understand the hardware and software decisions that made training possible. BLOOM's report therefore became a reference for later open LLM projects that needed to reason about A100 memory, parallelism, batch size, data mixtures, multilingual tokenization, and cluster reliability.

Limits

The compute recipe is still expensive and conservative. The authors excluded MoE because the GPU software path was not mature enough for the project, so BLOOM does not test whether sparse conditional compute could have delivered better quality per FLOP. Many architecture and objective decisions were selected at much smaller scales, and the paper explicitly notes that small-scale ablations do not always transfer cleanly to 176B training.

The run is also not a complete cost model for deployment. The carbon section estimates about 81 tons CO2e across training, equipment manufacturing, and idle cluster consumption, but API deployment is treated separately and depends on batching, request volume, implementation, and serving hardware. Finally, BLOOM is multilingual but not uniformly strong: underrepresented languages remain weak, and safety, bias, toxicity, hallucination, and calibration limitations remain visible.