Hyperscale dense LLM training

Training becomes a datacenter-scale problem: model/data parallelism, optimizer state sharding, compute-optimal scaling.

19 papers Regime 5 of 10

Hyperscale dense LLM training

Device/setup

Large GPU/TPU clusters with model, data, pipeline, optimizer-state, and sometimes sequence parallelism; memory and interconnect dominate.

Bottleneck

Parameter/optimizer memory, communication, dataset scale, data-token allocation, reliability, and inference cost at frontier model size.

Methods that fit

GPT-style dense Transformers, Megatron tensor parallelism, ZeRO/FSDP, scaling laws, Chinchilla-style compute allocation, Pathways/TPU-pod training, and open dense LLM releases matched hyperscale dense clusters.

Methods that became obsolete or less central

Pure parameter-count scaling without token-budget, inference-budget, and reliability accounting became less convincing after compute-optimal analyses.

Representative papers

Rank Year Paper Priority Status
45 2019 Language Models are Unsupervised Multitask Learners 5 downloaded / read_complete
46 2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 5 downloaded / read_complete
47 2019 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models 5 downloaded / read_complete
48 2020 Language Models are Few-Shot Learners 5 downloaded / read_complete
49 2020 Scaling Laws for Neural Language Models 5 downloaded / read_complete
50 2022 Training Compute-Optimal Large Language Models 5 downloaded / read_complete
51 2022 PaLM: Scaling Language Modeling with Pathways 5 downloaded / read_complete
52 2021 Scaling Language Models: Methods, Analysis and Insights from Training Gopher 4 downloaded / read_complete
53 2023 Llama 2: Open Foundation and Fine-Tuned Chat Models 4 downloaded / read_complete
54 2023 LLaMA: Open and Efficient Foundation Language Models 4 downloaded / read_complete
55 2022 BLOOM: A 176B-Parameter Open-Access Multilingual Language Model 3 downloaded / read_complete
56 2022 GPT-NeoX-20B: An Open-Source Autoregressive Language Model 3 downloaded / read_complete
57 2022 OPT: Open Pre-trained Transformer Language Models 3 downloaded / read_complete
58 2023 Gemini: A Family of Highly Capable Multimodal Models 3 downloaded / read_complete
59 2023 Mistral 7B 3 downloaded / read_complete
60 2023 Textbooks Are All You Need 3 downloaded / read_complete
61 2024 The Llama 3 Herd of Models 3 downloaded / read_complete
62 2023 A Survey of Large Language Models 1 downloaded / read_complete
124 2025 Qwen3 Technical Report 4 downloaded / read_complete

Open questions

  • Distinguish closed frontier infrastructure claims from reproducible open training stacks and published hardware details.

Papers in this compute regime 19