A Survey of Large Language Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2023-llm_survey_2023.pdf
Extracted text: 2023-llm_survey_2023.txt
PDF URL: https://arxiv.org/pdf/2303.18223.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

This is a survey, not a primary experimental paper, so it has no single training run, hardware allocation, or inference deployment to reconstruct. The local extracted source is arXiv:2303.18223v19 dated March 18, 2026, while the reading card is assigned to 2023. For this project it should therefore be used as secondary taxonomy/context and as a pointer to primary cards, not as the sole source for a 2023 hardware claim.

The survey does, however, collect device-era evidence in its model statistics table. Examples include T5 11B trained on 1,024 TPU v3 chips, T0 11B on 512 TPU v3 for 27 hours, GPT-NeoX-20B on 96 40GB A100 GPUs, OPT 175B on 992 80GB A100 GPUs, BLOOM 176B on 384 80GB A100 GPUs for 105 days, LLaMA 65B on 2,048 80GB A100 GPUs for 21 days, StarCoder 15.5B on 512 40GB A100 GPUs, and LLaMA2 70B on 2,000 80GB A100 GPUs. For models whose original papers omit hardware, such as GPT-3 in this table, this card should preserve that omission and defer to primary evidence or the project inference rule.

Bottleneck

The survey's compute bottleneck is the whole LLM stack rather than one device. It emphasizes that LLM training requires large-scale data processing and distributed parallel training, and that research and engineering are no longer cleanly separated. Model and dataset sizes push two technical problems to the front: increasing training throughput and loading larger models into GPU memory. The survey also frames scaling laws around model size, dataset size, and training compute, with Chinchilla-style analysis arguing that compute allocation between parameters and tokens matters, not just raw model size.

Memory is a first-order bottleneck in the survey's later practical sections. It estimates that model states during mixed-precision Adam/AdamW training require 16P bytes for a model with P parameters: FP16/BF16 parameters and gradients plus FP32 optimizer state. It gives LLaMA-7B as an example requiring about 100GB just for model states, before activations. It also estimates that LLaMA-7B activations can take 16GB per device for batch size 1 and sequence length 2,048. These numbers explain why even "small" multi-billion-parameter models already exceed a single commodity GPU when trained with standard optimizers.

Method Adaptation

The survey organizes the methods that adapt LLM work to hardware constraints. For training, it describes 3D parallelism as a combination of data parallelism, pipeline parallelism, and tensor parallelism. Data parallelism increases throughput by splitting batches but replicates model state; pipeline parallelism splits layers across GPUs but creates bubble overhead; tensor parallelism splits parameter matrices and performs parts of a matrix multiplication on different devices. The survey notes that these methods are often combined, citing BLOOM's use of 8-way data parallelism, 4-way tensor parallelism, and 12-way pipeline parallelism on 384 A100 GPUs.

Memory-saving techniques are presented as necessary complements. Mixed precision reduces memory and communication; FP16 is faster on GPUs such as A100 but can hurt numerical accuracy, while BF16 preserves more exponent range and is generally better for pretraining representation accuracy. ZeRO and FSDP reduce redundancy by partitioning optimizer states, gradients, and parameters across GPUs; activation recomputation/checkpointing trades extra compute for lower activation memory. The survey also points to FlashAttention and sequence parallelism as ways to reduce attention IO and distribute long-sequence work.

For inference, the survey separates prefill and decoding. Prefill processes the prompt and builds the KV cache; decoding generates new tokens autoregressively and repeatedly reads cached keys and values. Its arithmetic-intensity analysis says prefill is compute-bound on an A100-like GPU, while decoding is constrained by memory transfer speed, the "memory wall." That motivates Flash-Decoding, PagedAttention, continuous batching in vLLM, quantization, and speculative/cascade-style decoding.

Evidence

The strongest evidence in this card is comparative and secondary. The hardware table shows the 2022-2023 transition to A100-80GB clusters as a default open-model training substrate: OPT at 992 A100-80GB, BLOOM at 384 A100-80GB, LLaMA at 2,048 A100-80GB, StarCoder at 512 A100-40GB, and LLaMA2 at 2,000 A100-80GB. The same table records non-NVIDIA systems such as TPU v3/v4 and Ascend 910, making it useful for device-era context. The survey's training section gives named frameworks, including DeepSpeed, Megatron-LM, Colossal-AI, Alpa, ZeRO, FSDP, gradient checkpointing, and BF16, as the practical toolkit for making those clusters usable.

The evidence should not be overread. Because this is v19 from 2026 and because many rows summarize other papers, any precise claim about a model's actual run should be verified against that model's primary card. The survey is best used to connect isolated cards into a compute map: scaling laws define why tokens and parameters trade off, distributed parallelism explains how models fit, memory analysis explains why optimizer state dominates training, and inference analysis explains why serving is often bandwidth-limited after the prompt prefill.

Historical Effect

Historically, this survey is a consolidation point. It codifies LLM development as a set of linked compute problems: data construction, scale prediction, distributed training, adaptation tuning, tool use, retrieval, evaluation, and deployment. For this history project, its value is not a single result but a checklist: primary-paper cards should cover scaling laws, dense pretraining, parallelism, memory optimization, RLHF, instruction tuning, tool/RAG inference, quantization, and serving.

Limits

The limits are substantial. The local source is a later 2026 version, so it includes later work and terminology that should not be projected backward into early 2023 without care. As a survey, it sometimes compresses hardware rows and omits caveats that primary papers discuss in detail. It cannot substitute for direct evidence on exact devices, batch sizes, interconnects, job reliability, or wall-clock time. Use it to identify the compute structure and then attach precise historical claims to primary sources such as GPT-NeoX, OPT, BLOOM, LLaMA, Llama 3, FlashAttention, or ZeRO.