WebGPT: Browser-assisted question-answering with human feedback

Metadata

Reading status: read complete
Year: 2021
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2021-webgpt_2021.pdf
Extracted text: 2021-webgpt_2021.txt
PDF URL: https://arxiv.org/pdf/2112.09332.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not disclose GPU/TPU model, host count, training wall-clock, or serving hardware. It reports experiments on fine-tuned models from the GPT-3 family at 760M, 13B, and 175B parameters. Under the project rule, the device setup is inferred as OpenAI's GPT-3-era accelerator infrastructure, but that is an inference from the provider and model family rather than a paper statement.

The compute scale is nevertheless described through model size, data, and training loop shape. The authors collect about 6,000 browsing demonstrations and about 21,500 comparisons, with final reward models trained on about 16,000 comparisons and 5,500 comparisons held out for evaluation. The main evaluated WebGPT models combine behavior cloning with reward-model rejection sampling: 760M best-of-4, 13B best-of-16, and 175B best-of-64. PPO is also tested: Table 7 lists 256 parallel environments, 256 timesteps per rollout, 1 epoch, and 128 minibatches per epoch. Appendix E states that due to GPU memory constraints they used 16 times as many minibatches per epoch as the default PPO setup, and that PPO iterations lasted multiple hours.

Bottleneck

The bottleneck is long-form factual QA where the model must not only answer but also search, navigate, quote, cite, and synthesize. A static language model can answer from parametric memory, but the task asks for up-to-date and checkable support. Live browsing makes the computation sequential: each search, click, scroll, find, or quote action changes the next textual observation, and the model has no hidden memory of previous steps except what is recorded in the browser state summary and later reference list.

Human feedback is the second bottleneck. Demonstrations teach the command format, but answer quality is optimized by preference comparisons and reward modeling. Human evaluations are noisy and expensive, so the system relies heavily on reward-model scores for early stopping, hyperparameter tuning, PPO rewards, and rejection sampling. That moves some human cost into model inference cost: best-of-n decoding samples many complete browser-assisted answers and lets the reward model choose among them.

Method Adaptation

WebGPT adapts the web to the language model rather than adapting the language model to a graphical browser. The source describes a text-based browser: the model receives a written state summary and must issue commands such as Search, click, scroll, find, quote, or answer. Search queries go to the Microsoft Bing Web Search API; clicked pages are fetched by a Node.js script and simplified using Mozilla Readability; HTML is converted to text; PDFs are converted with pdfminer.six; and references are formatted into the answer prompt. This design converts a high-bandwidth web interface into a token stream that fits GPT-3 prompting and finetuning.

The training stack has four layers. Behavior cloning trains on human browser commands. Reward modeling scores question-answer-reference triples using comparison labels. PPO fine-tunes the browsing policy against reward-model score plus a KL penalty to the behavior-cloned model. Rejection sampling spends no additional gradient updates but samples 4, 16, or 64 candidate answers and selects the one with the best reward-model score. This is a direct inference-time compute tradeoff: more full answer attempts, more web trajectories, and more reward-model evaluations in exchange for better human preference.

The PPO details show adaptation to memory and parallelism. The authors use separate policy and value networks for simplicity, chunk long completions into multiple actions to improve rollout parallelizability, insert 15 additional answering-only episodes after each browsing episode to improve sample efficiency by about 2x, and set a maximum of 64 tokens per action. GPU memory pressure forces many small PPO minibatches, while 256 parallel environments keep rollouts broad enough for PPO clipping.

Evidence

The top-line evidence is preference-based. The 175B best-of-64 model is preferred to answers written by human demonstrators 56% of the time and to ELI5 highest-voted reference answers 69% of the time. The abstract also reports that the best model's answers are judged true 75% of the time and both true and informative 54% of the time, outperforming base GPT-3 but falling short of human performance on that factuality evaluation.

The ablation evidence supports the inference-time compute story. Rejection sampling provides a large benefit: the 175B best-of-64 behavior-cloned model is preferred 68% of the time over the 175B behavior-cloned model without rejection sampling. PPO gives a smaller gain: the 175B RL model is preferred 58% of the time over the 175B BC model. The authors argue that rejection sampling can try many answering attempts, visit more websites, and evaluate the found information with hindsight, while PPO is harder to tune and can overoptimize the same reward model. Combining RL with rejection sampling gives little additional benefit over rejection sampling alone.

The data scale also matters. About 6,000 demonstrations teach the browser format, while about 21,500 comparisons support reward modeling and evaluation. RL uses a mixture of 90% ELI5 and 10% TriviaQA questions, with browsing action limits randomized from 20 to 100. Early stopping differs by model size: Table 8 gives 2/5/3 BC epochs for 760M/13B/175B and RL stopping at 19, 30, and 18 PPO iterations respectively.

Historical Effect

WebGPT is an early, concrete demonstration that a frontier language model can operate a textual tool interface, collect references, and then be improved with human feedback. Its compute lesson is that retrieval or browsing is not just a preprocessing stage; it can be part of the policy's action loop. It also showed that best-of-n reward-model reranking can be a strong alternative to additional RL when inference budget is available. Later agent, RAG, and RLHF systems reused this pattern of tool environment, behavior cloning, preference model, KL-constrained optimization, and inference-time candidate selection.

Limits

The paper's own omissions are important for a compute card: hardware is not listed, so device-level cost cannot be reconstructed from the source. The method is much more expensive than static QA because every candidate answer can include a sequence of browser actions plus a final long-form generation, and best-of-64 multiplies that cost. Browser state is text-compressed, so the model can miss information hidden by simplification, search ranking, page conversion, or reference truncation. Reward-model optimization can favor superficially convincing citations, and the authors note that RL can overoptimize the reward model. The system also depends on external search infrastructure and on human labeler judgments that are costly and sometimes noisy.