RLHF and preference optimization
Compute interpretation
Post-training method family that spends additional optimization and annotation compute to shape model behavior.
Supporting reading cards
- Deep Reinforcement Learning from Human Preferences (2017,
inference_time_compute_post_training) - Learning to summarize from human feedback (2020,
inference_time_compute_post_training) - WebGPT: Browser-assisted question-answering with human feedback (2021,
inference_time_compute_post_training) - Training language models to follow instructions with human feedback (2022,
inference_time_compute_post_training) - Constitutional AI: Harmlessness from AI Feedback (2022,
inference_time_compute_post_training) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023,
inference_time_compute_post_training) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025,
inference_time_compute_post_training) - Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025,
inference_time_compute_post_training) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025,
sparse_memory_efficient_scaling) - Kimi K2.5: Visual Agentic Intelligence (2026,
inference_time_compute_post_training) - Qwen3.5-Omni Technical Report (2026,
generative_media_compute)
Obsolete or less central under later compute
Track this only through linked reading cards; do not treat this method page as standalone evidence.