Learning to summarize from human feedback

Metadata

Reading status: read complete
Year: 2020
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2020-summarize_human_feedback_2020.pdf
Extracted text: 2020-summarize_human_feedback_2020.txt
PDF URL: https://arxiv.org/pdf/2009.01325.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not name the exact GPU model, node count, memory size, or cluster topology. It does report concrete compute facts. The authors use GPT-3-style Transformer decoders with 1.3B and 6.7B parameters for the human-feedback experiments. Appendix B states that all models use fp16 activations and Adam; nearly all supervised baselines, reward models, and reinforcement-learning models use fp32 weights, with TL;DR supervised baselines as the fp16-weight exception.

For pretraining, the models are trained on Common Crawl, WebText, books, and Wikipedia for 200-300B tokens, with input length 2048 and batch size ramping to a model-specific maximum. The hyperparameter table lists maximum batch sizes of 512 for 1.3B, 3B, and 6.7B, and 1024 for 13B. Supervised TL;DR finetuning uses batch size 128 for one epoch. Reward modeling uses batch size 64 for one epoch. PPO uses separate policy and value networks, batch size 512 for the 1.3B model and 256 for the 6.7B model, four epochs of optimization per rollout batch, and 1M episodes. The paper explicitly states that RL fine-tuning the 6.7B model required approximately 320 GPU-days. Under the project rule, the undisclosed hardware is inferred as OpenAI GPU cluster infrastructure.

Bottleneck

The bottleneck is objective quality under the cost of human preference data and RL. ROUGE and reference-summary likelihood are cheap to optimize, but the paper argues that they are rough proxies for human summary quality. Optimizing them can reward copying, length artifacts, or lexical overlap rather than coverage, accuracy, coherence, and overall usefulness. Human comparisons are a better target, but they are expensive to collect and require an additional reward-model training loop.

The compute bottleneck appears after labels are collected. PPO for sequence generation is much heavier than supervised finetuning because each episode requires sampling, reward-model scoring, KL accounting, value estimation, and multiple optimization passes. The separate value network doubles large-model state during RL, but the paper adopts it because sharing value and policy parameters can damage the pretrained policy early in training.

Method Adaptation

The method adapts RLHF to a batch summarization setting. The authors first fine-tune GPT-style models on filtered TL;DR summaries, producing supervised baselines. They sample summaries from existing policies, references, and baselines, then ask human evaluators to choose the better summary in pairwise comparisons. A reward model is initialized from the supervised baseline with an added scalar head and trained to predict the log odds that one summary is preferred over another. Reward outputs are normalized so reference summaries have mean score zero.

The policy is then optimized with PPO. The reward is the reward-model score minus a KL penalty to the supervised model. The KL term acts as an entropy bonus and a guardrail against drifting outside the reward model's training distribution. Each generated token is a time step, but reward arrives only for the full summary at the end.

The separate value network is a memory-heavy adaptation. It prevents value updates from partially destroying the pretrained policy, and it is initialized from the reward model. In the reported experiments, reward model, policy, and value function are the same size, so a 6.7B PPO run carries multiple 6.7B-class networks through the training pipeline. The paper also explores best-of-N rejection sampling as a no-training alternative: sample N summaries from the supervised baseline, score them with a reward model, and choose the highest-scoring sample. This shifts cost from training to inference.

Evidence

The human-data scale is explicit: the released dataset contains 64,832 TL;DR summary comparisons, plus evaluation data. The filtered supervised TL;DR dataset contains 123,169 posts with summaries. The reward-model scaling study trains models from 160M to 13B parameters on 8k to 64k comparisons and finds that doubling data improves validation accuracy by about 1.1 points, while doubling model size improves it by about 1.8 points.

The main preference result shows why the expensive loop was worth running. On TL;DR, a 1.3B human-feedback model beats a 13B supervised model in raw preference against references, 61% versus 43%. The 6.7B human-feedback model is still preferred to references about 65% of the time after controlling for length. In Likert evaluations, 6.7B PPO summaries receive a perfect 7/7 overall score 45% of the time, compared with 20% for the 6.7B supervised baseline and 23% for reference summaries.

The paper also establishes that reward models are better optimization targets than ROUGE. In best-of-N experiments up to best-of-2048 using samples from the 1.3B supervised baseline, optimizing ROUGE peaks sooner and at lower human preference than optimizing learned reward models. The authors warn that reward-model optimization eventually overfits too, but it degrades later than ROUGE. This directly connects inference-time search budget, metric choice, and human preference quality.

Historical Effect

This paper made RLHF a practical post-training compute pattern for large language generation. It moved preference-based reward modeling into multi-billion-parameter text generation, with a clear three-stage pipeline: supervised initialization, reward-model training, and KL-regularized PPO. It also showed that a smaller human-feedback model could beat a much larger supervised model.

Historically, the paper also established the metric lesson that carried into instruction following: a learned reward model can be a more useful optimization target than a cheap automatic metric, but only within a controlled optimization budget. This became one of the main templates for later alignment and assistant training systems.

Limits

The paper is clear that reward models can be over-optimized. PPO policies with too little KL penalty eventually receive high reward-model scores while producing worse summaries by human judgment. Best-of-N search has the same structural risk. The reward model is only trained on a relatively narrow distribution of summaries and comparisons, so optimization pressure can expose errors.

The compute cost is substantial: approximately 320 GPU-days for the 6.7B RL finetuning run, plus the cost of pretraining, supervised finetuning, reward-model sweeps, and human labeling. Hardware is not disclosed, so device-level reproducibility is incomplete. The task is also bounded: summaries are fewer than 48 tokens for TL;DR, input posts are filtered to fit context, and the labeler preferences define the target. The method improves human-rated quality in the reported setting, but it does not remove factual errors, labeler bias, or the expense of preference collection.