GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Download PDF

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Metadata

Compute Setup

The paper explicitly reports the cluster. GPT-NeoX-20B was trained on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and two AMD EPYC 7532 CPUs, for 96 A100 GPUs total. Each GPU can access the InfiniBand fabric through one of four ConnectX-6 HCAs using GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches form the InfiniBand spine. The software stack is PyTorch 1.10.0 built with CUDA 11.1 and NCCL 2.10.3, plus a codebase built on Megatron and DeepSpeed.

The model has 20B parameters, 44 layers, hidden size 6,144, and 64 attention heads. Training uses contexts of 2,048 tokens and an approximate 3.15M-token batch, described as 1,538 contexts per step, for 150,000 steps with cosine learning-rate decay. The model is trained on the Pile, which the paper describes as more than 825 GiB of raw text from 22 data sources.

Bottleneck

The bottleneck is fitting a dense 20B model on a relatively small open cluster with 40 GB A100s. The paper states directly that weights and optimizer states at this scale do not fit on a single GPU, so memory placement determines the parallelism plan. Compared with 80 GB A100 hyperscale runs, GPT-NeoX has less per-device memory and fewer total GPUs, so it must keep the most communication-heavy operations inside a node where NVLink/NVSwitch bandwidth is available.

The bound is concrete. With mixed-precision Adam, a 20B model has a rough training-state lower bound of 20B x 16 bytes, or about 320 GB, before activations and temporary buffers. A BF16/FP16 inference copy is already about 40 GB of weights; at 44 layers, 64 heads, head dimension 96, and context 2048, the batch-1 KV cache adds about 2.2 GB. That makes single-40GB-GPU inference unrealistic without lower precision or offload, and explains why the paper frames inference as economical on a larger-memory A6000 or a pair of consumer GPUs while finetuning remains a cluster task.

Communication is the second bottleneck. Tensor parallelism creates frequent collectives inside layers; pipeline parallelism moves activations between stage boundaries; data parallelism synchronizes replicas. The paper's chosen topology is therefore not arbitrary: tensor parallel size 2 and pipeline parallel size 4 let those expensive model-parallel communications occur within an eight-GPU node, while data-parallel communication crosses node boundaries over InfiniBand.

A third bottleneck is token efficiency. At fixed training steps, context length, and batch size, a tokenizer that uses fewer tokens to represent the same corpus effectively reduces sequence compute and lets the model see more text per token budget. The paper treats tokenizer design, especially whitespace handling for code and LaTeX-heavy data, as a compute-relevant choice.

Method Adaptation

GPT-NeoX adapts Megatron/DeepSpeed training to its 12-node hardware. AdamW is extended with ZeRO so optimizer states are distributed across ranks. Tensor parallelism from Megatron splits large layer computations, while pipeline parallelism splits the model across GPUs. The authors report that tensor parallel size 2 and pipeline parallel size 4 are most efficient for their hardware, because they keep the heaviest model-parallel traffic within a node and reserve cross-node communication mostly for data parallelism.

The architecture includes a specific communication optimization: attention and feed-forward layers are computed in parallel and then summed, rather than run serially. The paper explains that with operator sharding, each residual addition can require all-reduce work in the forward and backward pass; doing attention and feed-forward in parallel allows local reduction before a single all-reduce. The authors cite a 15% throughput increase for this pattern in Mesh Transformer JAX, with comparable early loss curves.

Other design choices reduce complexity or improve efficiency. GPT-NeoX uses rotary positional embeddings on the first 25% of embedding dimensions, all dense layers rather than GPT-3-style alternating dense/sparse layers, and an initialization designed to keep activations from growing with depth and width. The tokenizer is trained on the Pile, has GPT-2's 50,257 vocabulary size, handles prefix spaces consistently, and includes repeated-space tokens up to length 24 to compress whitespace-heavy documents.

Evidence

The training stack achieves and maintains 117 TFLOP/s per GPU. This is the key systems evidence: a public Megatron/DeepSpeed stack on 96 A100-40GB GPUs can sustain high utilization while training a 20B dense model.

The tokenizer evidence is quantified. On the Pile validation set, the GPT-2 tokenizer produces 383,111,734 tokens, while the GPT-NeoX tokenizer produces 342,887,807, a ratio of about 0.895. The paper summarizes this as approximately 10% fewer tokens overall, with more than 20% savings on EuroParl, GitHub, and PubMed Central components. Even excluding whitespace tokens, the GPT-NeoX tokenizer still uses about 5% fewer tokens on the Pile validation set. A code example in the paper drops from 55 GPT-2 tokens to 39 GPT-NeoX tokens.

Evaluation evidence shows a useful public model, not just a training exercise. In zero-shot tables, GPT-NeoX-20B improves over GPT-J-6B on tasks such as LAMBADA (0.720 versus 0.683), PIQA (0.779 versus 0.752), TriviaQA (0.259 versus 0.170), ARC-Easy (0.723 versus 0.670), and ARC-Challenge (0.380 versus 0.340). Against FairSeq models, the authors say GPT-NeoX-20B outperforms on 22 of 32 evaluations, underperforms on four, and is within error on six. For five-shot evaluations, GPT-NeoX-20B improves by 0.0598 from zero-shot to five-shot, much more than the FairSeq 13B model's 0.0183.

Historical Effect

GPT-NeoX made a 20B dense autoregressive model and its training stack a public infrastructure artifact. Historically, it sits between GPT-J-scale open models and 100B+ open-science efforts such as BLOOM and OPT: large enough to require serious model parallelism, but small enough to train on a 12-node A100 cluster.

The paper's lasting compute contribution is its operational recipe: Megatron/DeepSpeed, ZeRO optimizer memory reduction, TP/PP/DP placement matched to node topology, a tokenizer optimized for the actual corpus, and release of weights, code, and intermediate checkpoints. It gave researchers outside frontier labs a concrete path for dense LLM training and evaluation.

Limits

The model is still far below 175B-class frontier scale, and the paper acknowledges that inference is not universally accessible: it says inference is most economical on a pair of RTX 3090 Ti GPUs or a single A6000, while finetuning requires significantly more compute. Training access remains a cluster problem even when weights are public.

There are also methodological limits. Hyperparameters are interpolated from GPT-3 and smaller experiments because full sweeps are infeasible at 20B. The Pile is used as-is without large-scale deduplication, and the authors leave deduplication for future work. Many design choices target coding performance, but the paper says they underestimated the difficulty and cost of coding evaluations and did not evaluate that domain. Finally, the architecture contains a disclosed oversight: two independent LayerNorms were applied where a tied layer norm was intended, noticed too late to restart.

Links

  • Compute regime: history/compute_regimes/hyperscale_dense_llm_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.