Textbooks Are All You Need

Metadata

Reading status: read complete
Year: 2023
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2023-phi_2023.pdf
Extracted text: 2023-phi_2023.txt
PDF URL: https://arxiv.org/pdf/2306.11644.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper reports the hardware directly: phi-1 is a 1.3B-parameter Transformer trained for 4 days on 8 NVIDIA A100 GPUs. In the methods section, the authors restate the setup as 8 Nvidia-A100 GPUs using DeepSpeed, with pretraining of phi-1-base completed in under 4 days and finetuning to phi-1 requiring an additional 7 hours on the same hardware. The training stack uses fp16, AdamW, sequence length 2048, and FlashAttention for multi-head attention.

The model is deliberately modest for 2023. The 1.3B architecture has 24 layers, hidden dimension 2048, MLP inner dimension 8192, and 32 attention heads of dimension 64. Pretraining uses effective batch size 1024 and 36,000 planned steps; the released phi-1-base checkpoint is taken at 24,000 steps, about 8 epochs over CodeTextbook and a little over 50B total training tokens. Figure 2.1 gives a concrete compute accounting: phi-1-base after 51B tokens used 770 GPU-hours, while a Stack+ comparison trained for 76B tokens used 1090 GPU-hours.

Bottleneck

The bottleneck is waste in the token stream. Most code LLM scaling work in 2023 spent far more accelerator time on much larger corpora and models. phi's claim is that code-generation performance is often limited by the density and pedagogical quality of examples, not just by parameter count or raw token count. If a small A100 run consumes redundant, noisy, poorly documented, or unbalanced repository text, GPU-hours are spent learning weak mappings from natural language to code.

This is compute-device-focused because the paper moves capability from the model/data scale axis to the data-quality axis. Eight A100s are enough for a 1.3B model with FlashAttention and fp16, but not enough to brute-force trillion-token frontier training. The method therefore has to make each token do more work.

Method Adaptation

phi-1 adapts to a small A100 budget by reducing both model size and corpus size while increasing data selectivity. The base corpus combines about 6B filtered code-language tokens from The Stack and StackOverflow with less than 1B synthetic GPT-3.5-generated Python textbook tokens. A separate CodeExercises finetuning set contains about 180M synthetic exercise-and-solution tokens. Together, the pretraining data contains less than 7B tokens, but the model makes about 8 passes over it.

The filtering step is a compute substitute for raw scale. The authors annotate about 100K samples with GPT-4 for educational value, train a classifier using embeddings from a pretrained codegen model, and select files that look like textbook-quality teaching examples. Synthetic textbooks target reasoning and basic algorithmic skills, while CodeExercises aligns the model to docstring-to-function completion.

Architecturally, phi-1 avoids exotic efficiency techniques beyond FlashAttention. The authors explicitly say they do not use Fill-In-the-Middle or Multi-Query Attention, both of which could further improve code-model efficiency. That makes the compute result easier to read: most of the advantage comes from data construction plus a standard decoder-only Transformer that fits comfortably on 8 A100s with DeepSpeed.

Evidence

The headline evidence is the scale-performance table. phi-1 has 1.3B parameters, a 7B-token dataset, and reaches 50.6% pass@1 on HumanEval and 55.5% on MBPP. In the same table, CodeGen-Mono-16.1B is trained on 577B tokens and reports 29.3% HumanEval and 35.3% MBPP; StarCoder has 15.5B parameters, 1T tokens, and reports 33.6% HumanEval and 52.7% MBPP. The comparison is not perfectly controlled, but it supports the compute-efficiency thesis.

Figure 2.1 isolates data quality. A 1.3B phi-1-base checkpoint trained on CodeTextbook reaches 29% HumanEval after 51B tokens and 770 GPU-hours. A The Stack+ model in the same figure was trained for 76B tokens and 1090 GPU-hours. The paper also reports that for 350M models, unfiltered Stack plus StackOverflow saturates at 12.19% HumanEval even after roughly 200B tokens, while the filtered subset reaches 17.68% after 36K steps and improves to 20.12% when synthetic textbooks are added.

Finetuning is small but high-impact. The CodeExercises dataset is under 200M tokens and takes about 7 hours on the same 8 A100s, yet the final phi-1 jumps to roughly 51% HumanEval. On the authors' 50-problem unconventional coding evaluation, phi-1 scores 52%, while phi-1-base scores 37%.

Historical Effect

phi is a data-efficiency counterpoint to the hyperscale dense LLM trend. It did not show that small models replace frontier models generally. It showed that, in a narrow domain with a carefully shaped curriculum, a small A100 training run could reach benchmark scores associated with much larger code models. That made synthetic textbook data, data filtering, and curriculum design part of the compute story rather than just dataset hygiene.

Historically, phi helped normalize the idea that "more useful tokens" can beat "more tokens" under a fixed accelerator budget. It also foreshadowed later small-model work in which training data generation, pruning, and distillation are treated as first-class levers for fitting capability into constrained training and inference budgets.

Limits

The result is domain-specific and benchmark-sensitive. The model is trained for Python code generation and simple docstring-style tasks, not broad assistant behavior. The paper withholds some synthetic data generation details for proprietary reasons, which limits reproducibility. The authors also note GPT-3.5-generated synthetic data can contain errors, and Appendix B describes weaknesses from the small 1.3B/7B-token scale: less robustness to ambiguous natural language, less generality than ChatGPT or StarCoder, and limited ability on complex application-building tasks.

The compute claim should therefore be read narrowly: phi demonstrates strong code benchmark returns from curation and synthetic exercises on 8 A100s, not a general replacement for larger multilingual or general-purpose pretraining runs.