Hyperscale dense LLM training
Training becomes a datacenter-scale problem: model/data parallelism, optimizer state sharding, compute-optimal scaling.
Hyperscale dense LLM training
Device/setup
Large GPU/TPU clusters with model, data, pipeline, optimizer-state, and sometimes sequence parallelism; memory and interconnect dominate.
Bottleneck
Parameter/optimizer memory, communication, dataset scale, data-token allocation, reliability, and inference cost at frontier model size.
Methods that fit
GPT-style dense Transformers, Megatron tensor parallelism, ZeRO/FSDP, scaling laws, Chinchilla-style compute allocation, Pathways/TPU-pod training, and open dense LLM releases matched hyperscale dense clusters.
Methods that became obsolete or less central
Pure parameter-count scaling without token-budget, inference-budget, and reliability accounting became less convincing after compute-optimal analyses.
Representative papers
Open questions
- Distinguish closed frontier infrastructure claims from reproducible open training stacks and published hardware details.