Llama 2: Open Foundation and Fine-Tuned Chat Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2023-llama2_2023.pdf
Extracted text: 2023-llama2_2023.txt
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly discloses hardware. Llama 2 pretraining ran on Meta's Research Super Cluster and internal production clusters, both using NVIDIA A100s. RSC used NVIDIA Quantum InfiniBand, while the production cluster used RoCE over commodity Ethernet switches; both connected 200 Gbps endpoints. The paper highlights this as an interconnect comparison and reports that RoCE scaled "almost as well" as InfiniBand up to 2000 GPUs. Power caps differed: 400 W per GPU on RSC and 350 W on the production cluster.

Training scale is given in GPU-hours by model size: 184,320 GPU-hours for 7B, 368,640 for 13B, 1,038,336 for 34B, and 1,720,320 for 70B, for a total of 3,311,616 A100-80GB GPU-hours. Carbon accounting reports 539 tCO2e for pretraining, while excluding interconnect, non-GPU server power, datacenter cooling, and hardware manufacturing. Fine-tuning, annotation, and evaluation used third-party cloud compute, but exact devices are not given.

The model scale is dense LLM pretraining: 7B, 13B, 34B, and 70B parameter variants, all trained on 2T pretraining tokens with 4K context and a global batch size of 4M tokens. The larger 34B and 70B models use grouped-query attention to improve inference scalability.

Bottleneck

The pretraining bottleneck is the dense-transformer mixture of FLOPs, memory, and distributed communication. A 4M-token global batch and 4K context require many A100s to run efficiently, and every token touches all parameters. The paper's attention to RoCE versus InfiniBand is a compute-structure clue: at this scale, interconnect price/performance becomes part of the model recipe.

The inference bottleneck is different. Llama 2 doubles context length from Llama 1's 2K to 4K, increasing attention KV-cache memory and per-token bandwidth. The 34B and 70B variants adopt grouped-query attention: fewer key/value heads reduce cache size and memory traffic during autoregressive decoding while preserving many query heads.

The memory bound makes that design choice concrete. A 70B model has about 1.12 TB of mixed-precision Adam training state at 16 bytes per parameter before activations. For BF16 inference, the 70B weights alone are about 140 GB. With GQA using 8 KV heads, 80 layers, head dimension 128, and a 4096-token context, batch-1 KV cache is about 1.3 GB, but batch 32 raises that cache to about 43 GB. Llama 2's serving limit is therefore dominated by weights at small batch and increasingly by KV cache as batch size and context grow.

Post-training adds another bottleneck: generating, ranking, and learning from many candidate responses. Reward modeling uses binary comparisons and trains separate helpfulness and safety reward models. Rejection sampling is performed only with the largest 70B chat model; smaller models are then fine-tuned on the 70B-selected data, effectively distilling expensive inference-time search into smaller checkpoints. PPO and reward-model scoring add additional passes over generated sequences, so alignment compute is a serving-style workload as much as a training workload.

Method Adaptation

The base model recipe is conservative and hardware-aware: RMSNorm, SwiGLU, RoPE, AdamW with beta1 0.9 and beta2 0.95, cosine schedule, 2000 warmup steps, weight decay 0.1, and gradient clipping 1.0. The compute adaptation is a larger and cleaner run: 40% more total tokens than Llama 1, 4K context, 2T tokens, and 4M-token global batches. The authors state that after 2T tokens the models still showed no sign of saturation, framing the release as compute-limited rather than data-saturated.

GQA is the most explicit inference adaptation. Only 34B and 70B use it, exactly where KV-cache and decode bandwidth become most painful. This is a good example of changing architecture for serving economics: pretraining remains dense and straightforward, but the attention layout is adjusted to make long-context, large-model decoding more scalable.

The chat pipeline adapts to alignment compute by separating stages. SFT uses a sequence length of 4096 and batch size 64. Reward models are initialized from chat checkpoints and trained with an effective batch size of 512 pairs, or 1024 rows. Rejection sampling samples multiple answers, scores them with the current reward model, and trains on the best answer; the paper studies N from 1 to 100 and notes that optimal temperature shifts during RLHF, with T around 1.2 to 1.3 best for the RLHF model when sampling 10 to 100 outputs. PPO then optimizes against the reward model. This is an explicit conversion of extra inference budget into better supervised targets and policy updates.

Evidence

The source provides both compute and benchmark evidence. Table 1 lists all Llama 2 base models as 2T-token, 4K-context runs with 4M-token global batches. Table 2 lists 3.3M A100-80GB GPU-hours across the family. The hardware section states that the RoCE production cluster was close enough to InfiniBand scaling up to 2000 GPUs to make large-scale training more economical.

On grouped academic benchmarks, Llama 2 70B reports 37.5 on code, 71.9 on common-sense reasoning, 63.6 on world knowledge, 69.4 on reading comprehension, 35.2 on math, 68.9 on MMLU, 51.2 on BBH, and 54.2 on AGI Eval in Table 3. The paper's summary says Llama 2 70B improves over Llama 1 65B by about 5 points on MMLU and 8 points on BBH, beats all open-source base models in that comparison, is close to GPT-3.5 on MMLU and GSM8K, but remains far behind GPT-4 and PaLM-2-L on several benchmarks, especially code.

Alignment evidence is compute-shaped. Reward model scaling improves with more data and larger models. Rejection sampling shows that the gap between median and max reward grows with sample count, direct evidence that extra candidate generation can become higher-reward training data.

Historical Effect

Llama 2 exposed an industrial-scale dense LLM recipe in unusually concrete compute terms. It showed that a well-tuned dense Transformer, trained for 2T tokens on millions of A100-hours, could define the open model frontier, and it normalized releasing both base and chat variants.

Compute-wise, it made serving constraints visible inside the architecture through GQA and made interconnect economics visible through the RoCE/InfiniBand comparison. The release helped shift open LLM work toward reproducible training recipes, long-context dense models, and alignment pipelines that use inference-time sampling and reward modeling as major compute consumers.

Limits

The hardware disclosure is good for pretraining but incomplete for the full system. Fine-tuning, annotation, evaluation, and third-party cloud compute are mentioned but not broken down by GPU type, GPU-hours, or carbon. The carbon estimate excludes non-GPU server power, interconnect power, datacenter cooling, and hardware lifecycle emissions, so the 539 tCO2e number is not a full lifecycle total.

The model itself remains English-heavy and static, with a September 2022 pretraining data cutoff. Safety coverage is incomplete; the 34B chat model was delayed because there was not enough time to red-team it. RLHF introduces its own compute-quality risks: reward model drift, distribution shift across iterative chat models, capability regressions during rejection-sampling iterations, and temperature/budget sensitivity when selecting among many generated samples.