LLaMA: Open and Efficient Foundation Language Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2023-llama_2023.pdf
Extracted text: 2023-llama_2023.txt
PDF URL: https://arxiv.org/pdf/2302.13971.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly names the training accelerator class: A100-80GB GPUs with NVLink systems. For the largest training run, the authors say that when training the 65B model their code processes about 380 tokens/sec/GPU on 2048 A100 GPUs with 80 GB of RAM, making the 1.4T-token run take about 21 days. The carbon table reports GPU-hours per model: 82,432 for LLaMA-7B, 135,168 for 13B, 530,432 for 33B, and 1,022,362 for 65B. The same section estimates that overall model development used 2048 A100-80GB GPUs for about five months, consuming about 2,638 MWh under their assumptions.

The model-size table gives the training scale. LLaMA-7B and 13B train on 1.0T tokens; LLaMA-33B and 65B train on 1.4T tokens. All use a 4M-token batch. The 65B model has 80 layers, dimension 8,192, 64 heads, and learning rate 1.5e-4. The dataset is made from publicly available sources only, with CommonCrawl as the largest component and additional C4, GitHub, Wikipedia, books, ArXiv, and StackExchange data.

Bottleneck

The paper's compute bottleneck is explicitly inference budget, not just training FLOPs. It argues that compute-optimal scaling laws identify how to allocate a fixed training budget, but they ignore the cost of serving a model at scale. Given a target quality level, a smaller model trained on more tokens can be preferable because it is cheaper and faster at inference, even if it is not the cheapest model to train to that point.

This reframes dense LLM scaling around memory footprint, decode latency, and deployability. A 175B model may amortize training cost poorly if every query requires much larger parameter reads and more multi-GPU serving. LLaMA's 13B result is the practical center of the paper: a model small enough to run on a single GPU for inference can beat or match much larger older models on many tasks after enough token training.

The implementation bottleneck remains large-scale A100 training. Attention memory, activation storage, all-reduce communication, and optimizer state all become expensive when the batch is 4M tokens and the 65B model is sharded over 2048 GPUs. LLaMA therefore needs memory-efficient attention, selective recomputation, model/sequence parallelism, and communication overlap to make the smaller-but-longer recipe finish in a practical time.

The memory bound makes the inference argument sharper. The 65B model has a mixed-precision Adam training-state lower bound of about 1.04 TB at 16 bytes per parameter before activations, so pretraining is still a hyperscale sharding problem. For serving, LLaMA-13B uses about 26 GB for BF16 weights before KV cache and runtime workspace; the paper's "single V100" point should therefore be read as a batch- and context-constrained inference envelope, not as large unused memory headroom. KV cache grows with context and batch, so the serving win comes from reducing model weight footprint enough that cache and workspace can still fit.

Method Adaptation

LLaMA adapts the dense Transformer recipe in two layers. At the scaling-policy level, it trains smaller models for many more tokens than older parameter-heavy recipes. The authors note that a 7B model continues improving beyond 1T tokens, even though Chinchilla-style guidance would train a 10B model on about 200B tokens for training-compute optimality. This is a deliberate choice to buy lower inference cost.

At the architecture and kernel level, LLaMA uses pre-normalization with RMSNorm, SwiGLU activations, and rotary positional embeddings. These choices preserve the accelerator-friendly dense matmul structure while improving stability and quality. The optimizer is AdamW with beta values 0.9 and 0.95, cosine learning-rate decay to 10% of peak, weight decay 0.1, gradient clipping 1.0, and 2,000 warmup steps.

The efficient implementation section is directly hardware-shaped. The xFormers causal attention implementation reduces memory usage and runtime by not storing attention weights and not computing masked key/query scores. Activation checkpointing is selective: the implementation saves expensive activations, such as linear-layer outputs, and manually implements Transformer backward functions rather than relying wholly on PyTorch autograd. To benefit fully, it uses model and sequence parallelism and overlaps activation computation with network communication from all-reduce operations.

Evidence

The paper's headline evidence is that LLaMA-13B outperforms GPT-3 175B on most reported benchmarks while being more than 10x smaller, and the 65B model is competitive with Chinchilla-70B and PaLM-540B. The authors explicitly state that LLaMA-13B can run on a single V100 GPU during inference, which is the key compute-effect claim: benchmark quality moves into a much cheaper serving envelope.

Concrete benchmark tables support this. On common-sense reasoning, LLaMA-13B reports 78.1 BoolQ, 80.1 PIQA, 79.2 HellaSwag, 73.0 WinoGrande, and 52.7 ARC-Challenge, compared with GPT-3's 60.5, 81.0, 78.9, 70.2, and 51.4. LLaMA-65B reports 85.3 BoolQ and 84.2 HellaSwag, exceeding the cited Chinchilla-70B values on all listed common-sense tasks except BoolQ and exceeding PaLM-540B except on BoolQ and WinoGrande.

On closed-book QA, LLaMA-65B reaches 68.2 zero-shot and 73.0 64-shot exact match on TriviaQA, while LLaMA-13B reaches 56.6 and 64.0. On MMLU, LLaMA-13B scores 46.9 average, above GPT-3's 43.9; LLaMA-65B scores 63.4, behind Chinchilla-70B's 67.5 and PaLM-540B's 69.3. On code, LLaMA-13B reaches 15.8 pass@1 on HumanEval and 22.0 on MBPP, outperforming LaMDA-137B in that table.

Historical Effect

LLaMA made the smaller-but-more-tokens recipe the center of open dense LLM development. It connected Chinchilla-era data scaling to deployment economics: not "what is the largest model we can train," but "what model reaches a quality target at an inference budget people can actually use." That shift explains why 7B and 13B checkpoints became so historically important even though the paper also trained 33B and 65B models.

The paper also established a public-data-only baseline that was competitive with closed data models. Combined with research-community release, this turned A100-trained dense models into a reproducible platform for fine-tuning, quantization, retrieval augmentation, instruction tuning, and local inference work.

Limits

The hardware report is precise about A100 class, GPU-hours, and aggregate development scale, but it does not provide a full cluster topology comparable to BLOOM's Jean Zay description. It also reports a five-month development estimate rather than a clean separation of every ablation, failed run, and final training job.

The model is a base foundation model, not a complete RLHF/chat system. The paper includes a small instruction-finetuning experiment but not the later full chat alignment recipe. It also reports important weaknesses: toxicity and bias evaluations remain limited, TruthfulQA scores improve over GPT-3 but remain low in absolute terms, and MMLU lags Chinchilla and PaLM, plausibly because LLaMA used much less books/academic-paper data. Finally, the 65B model still requires hyperscale A100 training; the serving wins do not make pretraining cheap.