Deep Reinforcement Learning from Human Preferences

Metadata

Reading status: read complete
Year: 2017
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2017-human_preferences_2017.pdf
Extracted text: 2017-human_preferences_2017.txt
PDF URL: https://arxiv.org/pdf/1706.03741.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper gives one exact hardware/cost statement. For the Atari experiments, it uses a Google Compute Engine virtual machine with 16 CPUs and one NVIDIA K80 GPU, costing about $700 per month. It estimates that training takes about a day, so compute cost is about $25, while training with 5k labels corresponds to about 5 hours of human labor, or about $36 at the U.S. minimum wage used in the paper. That explicit comparison is central: the authors are showing that human feedback can become comparable to, rather than orders of magnitude larger than, the compute cost.

The paper does not list a separate hardware setup for MuJoCo. Under the project rule, the MuJoCo runs are inferred to use 2017-era CPU/GPU workstation or cloud hardware suitable for TRPO simulation and neural reward-model training, but no specific device should be claimed beyond the Atari GCE/K80 statement. Algorithmically, Atari policies use the A3C algorithm in synchronous form, A2C, with 16 parallel workers, 5 steps per update, 50M actual training timesteps, Adam, and the DQN-style convolutional policy architecture from Mnih et al. MuJoCo policies use TRPO. The reward predictor is trained asynchronously from the RL agent.

Bottleneck

The bottleneck is supervision bandwidth, not raw simulator throughput alone. Deep RL can require hundreds or thousands of hours of experience, but asking humans to label every action or every state would be prohibitively expensive. The paper's abstract says the method solves Atari and simulated robot locomotion while providing feedback on less than 1% of the agent's interactions with the environment. The whole compute structure is built around using many cheap environment interactions and a tiny number of expensive human comparisons.

There is also a nonstationarity bottleneck. The reward predictor is learned from preferences over the agent's current behavior, while the policy is simultaneously changing to exploit the predictor. If labels are only collected at the beginning, the predictor can fail on later states; if labels are too sparse or delayed, the RL policy can exploit errors in the learned reward. The authors therefore need an online, asynchronous pipeline where trajectories, human comparisons, and reward-model parameters continually update each other.

Method Adaptation

The method adapts human feedback to RL by asking for pairwise comparisons between short trajectory segments. A human sees two clips and picks which is better, or can mark a tie or "can't tell." This is much lower bandwidth than full demonstrations or dense reward labels. The learned reward predictor is trained by supervised learning on these comparisons, and the policy then optimizes the predicted reward with a standard RL algorithm.

Several choices fit the compute and memory limits. For MuJoCo, compared segments last 1.5 seconds, varying from 15 to 60 timesteps. For Atari, the reward predictor uses the same 84x84 image input style as the policy, stacks 4 frames into an 84x84x4 tensor, and runs four convolutional layers followed by a 64-unit fully connected layer and scalar output. Atari clips are 25 timesteps, about 1.7 seconds at 15 fps with frame skipping. The predictor uses batch norm, dropout, and regularization to avoid overfitting because the label buffer is small.

The feedback scheduler is another compute adaptation. In Atari, the system collects 500 comparisons from a random policy at the beginning, pretrains the reward predictor for 200 epochs, then decays the label rate roughly with training progress. The predictor maintains only the last 3,000 labels and loops over that buffer so recent labels under distribution shift matter. Except where stated otherwise, the system uses an ensemble of three predictors, draws 10 times more candidate clip pairs than it will show the human, and selects pairs with high variance across predictors. That spends model compute to save human attention.

Evidence

On MuJoCo, Figure 2 evaluates eight simulated robotics tasks and compares 700 human queries, synthetic query counts, and RL using the true reward. The text states that with 700 labels the method nearly matches reinforcement learning on all of these tasks. It also reports that by 1,400 labels the learned reward sometimes performs slightly better than the original true reward, likely because the learned reward is better shaped by assigning positive reward to behaviors typically followed by high reward. Human feedback ranges from half as efficient as ground-truth feedback to equally efficient, and on Ant it outperforms synthetic feedback because the human instruction to prefer standing upright provides useful shaping.

On Atari, Figure 3 uses 5,500 human queries. The method has more difficulty matching true-reward RL, but shows substantial learning on most games and matches or exceeds RL on some. With synthetic labels, BeamRider and Pong match or approach RL with 3,300 labels; Seaquest and Qbert approach RL more slowly; SpaceInvaders and Breakout improve substantially but do not match RL. With real human feedback, performance is similar to or slightly worse than synthetic feedback with the same number of labels, and often comparable to synthetic feedback with 40% fewer labels. Qbert is a failure case because short clips are hard for annotators to evaluate.

The qualitative behaviors show the low-human-time claim. A Hopper backflip is trained with 900 queries in less than an hour; a Half-Cheetah moving forward on one leg uses 800 queries in under an hour; and an Enduro behavior that stays alongside other cars uses about 1,300 queries and 4M environment frames. The cost footnote ties this to hardware: one Atari day on the 16-CPU/K80 GCE VM costs about $25, while 5k labels cost about 5 human hours.

Historical Effect

Historically, this paper is a compute bridge from classic RL to later RLHF. It shows that preference labels can train a learned reward model that is then optimized by a high-throughput RL loop. The important compute move is amortization: humans label a small, actively selected subset of short clips, and the agent converts those comparisons into millions of environment interactions. Later language-model RLHF changes the domain and model scale, but keeps the pattern of pairwise preferences, reward modeling, KL/regularized policy optimization, and concern about reward exploitation.

Limits

The method is limited by both human and model failure modes. Human labels can be noisy, inconsistent across contractors, or unevenly distributed in time, which concentrates supervision in narrow state regions. Some environments are hard to judge from short clips; Qbert is explicitly called out as confusing for annotators. Learned rewards can be exploited if feedback is not collected online, and ablations show that online feedback and ensemble-based query selection matter. The hardware statement is also narrow: only the Atari GCE/K80 setup is explicit, so MuJoCo device-level cost remains unspecified. Finally, the reward predictor and RL policy are small by later standards; the result establishes the preference-learning structure, not a scalable recipe for today's frontier-model hardware.