推理阶段计算与后训练

研究前沿转向推理阶段的算力分配:RLHF、思维链、验证器、检索、工具和智能体。

20 篇论文 第 8 个,共 10 个计算范式

Inference-time compute and post-training

英文原文文件:README.md

设备/设置

大预训练模型配合检索系统、人类反馈管线、外部工具,以及推理时采样或搜索预算。

瓶颈

静态权重难以低成本地存储所有知识,也难以在服务时兼顾推理能力、对齐质量、工具使用与事实准确性。

适配的方法

RAG/REALM、RLHF 与偏好优化、思维链、自一致性、ReAct、Toolformer、过程监督、DPO、Tree of Thoughts 和 Voyager 均在预训练之后追加计算。

变得过时或不再中心的方法

没有检索、反馈、工具或 deliberation 的一次性 next-token 预测,在助手和智能体中不再占据核心地位。

代表性论文

排名 年份 论文 优先级 状态
87 2020 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks 5 downloaded / read_complete
88 2022 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 5 downloaded / read_complete
89 2022 Training language models to follow instructions with human feedback 5 downloaded / read_complete
90 2022 ReAct: Synergizing Reasoning and Acting in Language Models 5 downloaded / read_complete
91 2017 Deep Reinforcement Learning from Human Preferences 4 downloaded / read_complete
92 2020 Learning to summarize from human feedback 4 downloaded / read_complete
93 2022 Self-Consistency Improves Chain of Thought Reasoning in Language Models 4 downloaded / read_complete
94 2023 Toolformer: Language Models Can Teach Themselves to Use Tools 4 downloaded / read_complete
95 2023 Let's Verify Step by Step 4 downloaded / read_complete
96 2020 REALM: Retrieval-Augmented Language Model Pre-Training 3 downloaded / read_complete
97 2021 WebGPT: Browser-assisted question-answering with human feedback 3 downloaded / read_complete
98 2022 Constitutional AI: Harmlessness from AI Feedback 3 downloaded / read_complete
99 2022 Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks 3 downloaded / read_complete
100 2023 Direct Preference Optimization: Your Language Model is Secretly a Reward Model 3 downloaded / read_complete
101 2023 Tree of Thoughts: Deliberate Problem Solving with Large Language Models 3 downloaded / read_complete
102 2023 Voyager: An Open-Ended Embodied Agent with Large Language Models 3 downloaded / read_complete
121 2025 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 5 downloaded / read_complete
122 2025 Kimi k1.5: Scaling Reinforcement Learning with LLMs 5 downloaded / read_complete
123 2025 s1: Simple test-time scaling 4 downloaded / read_complete
131 2026 Kimi K2.5: Visual Agentic Intelligence 4 downloaded / read_complete

开放问题

  • 量化额外推理阶段计算何时比扩大预训练权重更可靠、更便宜。

相关论文 20