← All methods

RLHF and preference optimization

Compute interpretation

Post-training method family that spends additional optimization and annotation compute to shape model behavior.

Supporting reading cards

Deep Reinforcement Learning from Human Preferences (2017, inference_time_compute_post_training)
Learning to summarize from human feedback (2020, inference_time_compute_post_training)
WebGPT: Browser-assisted question-answering with human feedback (2021, inference_time_compute_post_training)
Training language models to follow instructions with human feedback (2022, inference_time_compute_post_training)
Constitutional AI: Harmlessness from AI Feedback (2022, inference_time_compute_post_training)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023, inference_time_compute_post_training)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025, inference_time_compute_post_training)
Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025, inference_time_compute_post_training)
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025, sparse_memory_efficient_scaling)
Kimi K2.5: Visual Agentic Intelligence (2026, inference_time_compute_post_training)
Qwen3.5-Omni Technical Report (2026, generative_media_compute)

Obsolete or less central under later compute

Track this only through linked reading cards; do not treat this method page as standalone evidence.