RLHF and preference optimization
英文原文文件:rlhf.md
计算解释
后训练方法族,通过投入额外的优化与标注算力来塑造模型行为。
支撑阅读卡
- Deep Reinforcement Learning from Human Preferences (2017,
inference_time_compute_post_training) - Learning to summarize from human feedback (2020,
inference_time_compute_post_training) - WebGPT: Browser-assisted question-answering with human feedback (2021,
inference_time_compute_post_training) - Training language models to follow instructions with human feedback (2022,
inference_time_compute_post_training) - Constitutional AI: Harmlessness from AI Feedback (2022,
inference_time_compute_post_training) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023,
inference_time_compute_post_training) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025,
inference_time_compute_post_training) - Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025,
inference_time_compute_post_training) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025,
sparse_memory_efficient_scaling) - Kimi K2.5: Visual Agentic Intelligence (2026,
inference_time_compute_post_training) - Qwen3.5-Omni Technical Report (2026,
generative_media_compute)
后续计算范式下过时或退居次要的内容
仅通过已链接的阅读卡追踪,不将本方法页视为独立证据来源。