The Llama 3 Herd of Models

Download PDF

The Llama 3 Herd of Models

Metadata

Compute Setup

The paper gives unusually detailed hardware. Llama 3 includes 8B, 70B, and 405B models; the flagship 405B pretraining run uses up to 16K NVIDIA H100 GPUs, each at 700W TDP with 80GB HBM3, on Meta's Grand Teton AI server platform. Each server has 8 GPUs and 2 CPUs, and the 8 GPUs inside a server are connected by NVLink. The 405B run uses RoCE fabric based on Arista 7800 and Minipack2 OCP switches; smaller Llama 3 models use NVIDIA Quantum2 InfiniBand. Both fabrics use 400 Gbps interconnects between GPUs.

The surrounding infrastructure is part of the compute setup, not incidental. The RoCE AI cluster contains 24K GPUs in a three-layer Clos network, but the paper notes that only up to 16K of those GPUs are used for Llama 3 pretraining. Storage uses Meta's Tectonic distributed file system with 240PB across 7,500 SSD servers, 2 TB/s sustained throughput, and 7 TB/s peak throughput. Checkpoints save each GPU's model state, ranging from 1MB to 4GB per GPU. The model is pretrained on 15.6T text tokens with an initial 8K context stage, followed by long-context pretraining to 128K.

Bottleneck

The bottleneck is dense-model training at a scale where every subsystem becomes visible. A 405B parameter model cannot fit or train efficiently through simple data parallelism; parameters, optimizer state, gradients, activations, and KV/context-related tensors must be split across thousands of H100s. Communication has to respect the network hierarchy: bandwidth and latency are highest inside a server over NVLink, lower across nodes, and lower still across aggregation layers. Storage has to absorb bursty checkpoint writes without pausing 16K GPUs for too long.

The raw bound is large even before long-context effects. A 405B dense model has a mixed-precision Adam training-state lower bound of about 6.48 TB at 16 bytes per parameter, before activations, temporary buffers, rematerialization overhead, and fragmentation. For BF16 inference, the weights alone are about 810 GB. With 126 layers, 8 KV heads, head dimension 128, and a 4096-token prompt, batch-1 KV cache is about 2.1 GB; at 128K context it is about 67.6 GB. This is why the serving section cannot be read as "parameter count only": BF16 weights already exceed one 8-H100 server, and long-context KV cache consumes another accelerator-scale slice of memory before batching.

Reliability is a compute bottleneck too. During a 54-day snapshot of 405B pretraining, the paper reports 466 job interruptions: 47 planned and 419 unexpected. About 78% of unexpected interruptions are attributed to confirmed or suspected hardware issues. A single slow straggler can slow thousands of GPUs, and the paper even reports a 1-2% diurnal throughput variation due to temperature effects on GPU dynamic voltage and frequency scaling. This is fleet engineering as much as model training.

Method Adaptation

The main adaptation is 4D parallelism in the order [TP, CP, PP, DP]. Tensor parallelism splits individual weight tensors and is placed closest to NVLink because it requires high bandwidth and low latency. Pipeline parallelism partitions layers into stages; the paper modifies the schedule so the number of contiguous microbatches is tunable, balances stages by moving layers away from the first and last stages, uses interleaving to reduce bubbles, and adopts asynchronous point-to-point communication. Context parallelism splits long sequences and is used for the 128K stage. Data parallelism is implemented with FSDP as the outermost dimension because it can tolerate higher network latency through asynchronous prefetching and gradient reduction.

The architecture also reflects inference and memory constraints. Llama 3 uses grouped-query attention with 8 key-value heads, reducing KV tensor size and making context-parallel all-gather less painful. The tokenizer uses a 128K vocabulary, and the 405B model uses 126 layers with a 16,384 token representation dimension. Training starts with smaller batches and shorter sequences for stability: 4M tokens at sequence length 4,096, then 8M tokens at sequence length 8,192 after 252M tokens, and 16M tokens after 2.87T tokens. Long-context training is delayed because self-attention compute grows quadratically with sequence length.

For serving, the paper uses pipeline parallelism and FP8. BF16 Llama 3 405B does not fit on one 8-H100 machine, so BF16 inference is parallelized across 16 GPUs on two machines. Within a machine, NVLink supports tensor parallelism; across machines, lower bandwidth and higher latency motivate pipeline parallelism. Microbatching improves throughput for 4,096-token prompts and 256-token outputs. FP8 uses native H100 support for most matrix multiplications, with row-wise scaling, capped dynamic scaling factors, and exclusions for the first and last Transformer layers.

Evidence

Table 4 reports the central training-efficiency evidence. At 8,192 sequence length on 8,192 GPUs, the 405B setup uses TP=8, CP=1, PP=16, DP=64, 16M tokens per batch, 430 TFLOP/s per GPU, and 43% BF16 MFU. At 16,384 GPUs for the same sequence length, it uses DP=128, 400 TFLOP/s per GPU, and 41% MFU. For the 131,072 sequence long-context stage, it uses TP=8, CP=16, PP=16, DP=8, 16M tokens per batch, 380 TFLOP/s per GPU, and 38% MFU. The small MFU drop at 16K GPUs is attributed to needing a lower batch size per DP group to keep global tokens per batch constant.

Operational evidence is equally important. The paper reports more than 90% effective training time despite daily maintenance interruptions. Unexpected interruption categories include faulty GPU, GPU HBM3 memory, host maintenance, GPU SRAM, GPU system processor, and silent data corruption. The authors build tools to reduce startup/checkpointing time, identify stragglers, and handle NVLink/RoCE failure modes.

The inference section gives serving evidence. BF16 405B inference requires 16 H100s across two machines. Microbatching improves throughput at the same local batch size for both prefill and decoding, trading some added latency for better throughput-latency balance. FP8 inference has negligible distributional impact in a 100,000-response reward-score comparison and gives up to 50% prefill throughput improvement over the two-machine BF16 baseline, with better decoding throughput-latency tradeoffs.

Historical Effect

Llama 3 is a historical marker for open-weight dense LLM training because it documents the infrastructure shape behind a 405B model. Earlier open-model cards often report parameter count and token count; this paper ties those to H100 fleet topology, storage, checkpointing, failure categories, MFU, parallelism dimensions, long-context training, and serving quantization. It makes clear that model quality at this scale is inseparable from cluster scheduling, network-aware parallelism, and reliability engineering.

Limits

The limits are practical and evidentiary. The run requires infrastructure far beyond ordinary reproduction: thousands of H100s, high-speed network fabric, a massive distributed file system, custom scheduling, tuned collectives, and dedicated operations. Many data and safety details are summarized rather than fully reproducible. The paper provides strong aggregate metrics, but independent verification of the full training recipe is impossible without comparable hardware and data. FP8 serving is promising but requires H100-native low-precision support and careful quality checks; naive FP8 can produce corrupted responses even when standard benchmarks look acceptable.

Links