Training language models to follow instructions with human feedback

Download PDF

Training language models to follow instructions with human feedback

Metadata

Compute Setup

The paper does not list the hardware device type, GPU/TPU model, node count, or memory per accelerator. Under the project rule, the device setup is inferred as OpenAI GPT-3-class accelerator infrastructure. The paper does provide strong compute proxies. All model architectures use the GPT-3 architecture, with 1.3B, 6B, and 175B parameter policies. Models use fp16 weights and activations with fp32 master copies of weights, a 2k-token context length, and Adam with beta1 0.9 and beta2 0.95. Prompts longer than 1k tokens are filtered out, and responses are capped at 1k tokens.

Supervised fine-tuning runs for 16 epochs with residual dropout 0.2, cosine learning-rate decay, batch size 32 for 1.3B and 6B, and batch size 8 for 175B. The final reward model is a single 6B model, trained for one epoch with batch size 64; each prompt has K=4 to K=9 completions, yielding up to 2,304 pairwise comparisons per batch after ranking expansion. RLHF trains all policies for 256k episodes over about 31k unique prompts, batch size 512, minibatch size 64, and one inner epoch per batch. The authors report aggregate compute: 4.9 petaflops/s-days for the 175B SFT model and 60 petaflops/s-days for the 175B PPO-ptx model, versus 3,640 for GPT-3 pretraining.

Bottleneck

The bottleneck is aligning a pretrained model to user intent without retraining the base model from scratch. GPT-3-scale pretraining produces broad capabilities but not reliably helpful, honest, instruction-following behavior. Prompting helps, but it does not change the model's underlying behavior enough for the API prompt distribution. The expensive resource becomes high-quality human demonstrations and comparisons, plus stable post-training on models up to 175B parameters.

There is also a compute stability bottleneck in reward modeling and PPO. A 175B reward model might lower validation loss, but the authors found it more unstable, less suitable as a value-function initialization, and much more expensive when paired with PPO. PPO can cause regressions on public NLP datasets, so the method mixes in pretraining gradients during RL.

Method Adaptation

InstructGPT adapts RLHF into a reusable assistant-training pipeline. Step 1 collects labeler demonstrations and trains a supervised policy. The SFT dataset contains about 13k training prompts from API and labeler-written sources. Step 2 collects comparison data: labelers rank K=4 to K=9 model outputs for a prompt, producing pairwise comparisons for reward-model training. The RM dataset contains about 33k training prompts. Step 3 optimizes the SFT policy with PPO on the reward model, using about 31k PPO prompts from the API distribution.

The 6B reward model is a key compute adaptation. Rather than train and serve a 175B reward model and 175B value model inside every PPO update, the authors use one 6B RM and one 6B value function for all policy sizes. This reduces memory and compute while making comparisons across policy sizes cleaner. The reward model outputs a scalar reward for the full prompt/response, and PPO applies a per-token KL penalty to the SFT model to mitigate reward over-optimization.

PPO-ptx adds another adaptation: pretraining gradients are mixed into PPO gradients. For each minibatch, PPO and pretraining gradients are computed in consecutive steps and accumulated, with the pretraining gradients multiplied by a coefficient. This adds training work but reduces regressions on tasks like SQuADv2 and DROP.

Evidence

The headline evidence is a compute-efficiency result in human preference space. Outputs from the 1.3B InstructGPT PPO-ptx model are preferred to outputs from 175B GPT-3, despite about 100x fewer parameters. The 175B InstructGPT model is preferred to 175B GPT-3 85 +/- 3% of the time and to few-shot prompted 175B GPT-3 71 +/- 4% of the time. This is the central historical point: a comparatively modest post-training budget can move user preference more than a very large increase in base-model parameter count.

The data pipeline scale is also explicit. The authors hire a team of about 40 contractors. The SFT dataset has 13k training prompts, the RM dataset 33k training prompts, and the PPO dataset 31k training prompts. The final evaluations use both training labelers and held-out labelers, and held-out labelers prefer InstructGPT at about the same rate as the labelers who generated training data. Reward-model cross-validation across labeler groups gives 72.4 +/- 0.4% accuracy on preferences from labelers in the training group and 69.6 +/- 0.9% on held-out labeler groups.

The paper's compute accounting supports the post-training framing. A 175B PPO-ptx run costs 60 petaflops/s-days versus 3,640 for GPT-3 pretraining. The authors state that this is a fraction of the pretraining spend and that RLHF is more effective for their customer task distribution than a 100x model-size increase.

Historical Effect

InstructGPT turned alignment into an industrial post-training pipeline: demonstrations, preference rankings, reward model, PPO, KL control, and pretraining mix. It made the assistant behavior of later chat models legible as a compute layer after pretraining rather than as a property of the base model alone. It also made human preference the main benchmark for instruction following, not just zero-shot NLP scores.

The compute lesson was decisive. Instead of training a larger base model to get better API behavior, the paper showed that a smaller aligned model could beat a much larger unaligned model in direct preference comparisons. That shifted investment toward data quality, labeler instructions, reward-model stability, and RLHF infrastructure.

Limits

Hardware remains undisclosed, so the exact device-level recipe cannot be reproduced from the paper. The cost numbers are aggregate petaflops/s-days, not accelerator counts, memory footprints, or wall-clock schedules. The method also inherits the preferences and blind spots of a small labeler pool. The authors explicitly frame alignment as alignment to their labelers' and researchers' preferences, not a broad notion of human values.

The model still makes simple mistakes, hallucinates, and can regress on public NLP tasks without PPO-ptx. Mixing pretraining gradients mitigates some regressions but increases training complexity and compute. The 6B reward model is a pragmatic compromise, not proof that reward modeling is solved. Finally, most comparisons are labeled by one contractor for cost reasons, so disagreement and plural preferences are only partially represented.

Links

  • Compute regime: history/compute_regimes/inference_time_compute_post_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.
  • Method index: rlhf
  • Ledger updates: compute bottlenecks