← 返回方法列表

RLHF and preference optimization

英文原文文件：rlhf.md

计算解释

后训练方法族，通过投入额外的优化与标注算力来塑造模型行为。

支撑阅读卡

Deep Reinforcement Learning from Human Preferences (2017, inference_time_compute_post_training)
Learning to summarize from human feedback (2020, inference_time_compute_post_training)
WebGPT: Browser-assisted question-answering with human feedback (2021, inference_time_compute_post_training)
Training language models to follow instructions with human feedback (2022, inference_time_compute_post_training)
Constitutional AI: Harmlessness from AI Feedback (2022, inference_time_compute_post_training)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023, inference_time_compute_post_training)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025, inference_time_compute_post_training)
Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025, inference_time_compute_post_training)
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025, sparse_memory_efficient_scaling)
Kimi K2.5: Visual Agentic Intelligence (2026, inference_time_compute_post_training)
Qwen3.5-Omni Technical Report (2026, generative_media_compute)

后续计算范式下过时或退居次要的内容

仅通过已链接的阅读卡追踪，不将本方法页视为独立证据来源。