DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Metadata

Reading status: read complete
Year: 2025
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2025-deepseek_r1_2025.pdf
Extracted text: 2025-deepseek_r1_2025.txt
PDF URL: https://arxiv.org/pdf/2501.12948.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

DeepSeek-R1 is built on DeepSeek-V3-Base, a MoE model with 671B total parameters and 37B activated per token. The local text reports explicit hardware for the reasoning runs: smaller 30B experiments used A100 GPUs; the 660B-scale DeepSeek-R1-Zero and DeepSeek-R1 runs used 64 x 8 H800 GPUs, i.e. 512 GPUs. R1-Zero took about 198 hours, R1 took about 80 hours, and SFT data creation used 5K H800 GPU-hours. Table 7 reports 101K H800 GPU-hours for R1-Zero, 41K for R1, 5K for SFT data creation, and 147K H800 GPU-hours total.

The computational device is not just the H800 cluster; it is a long-CoT RL factory. Rollout workers use vLLM with the actor model, the inference module loads reference and reward models, a rule-based reward module runs answer matching, format checks, and code execution, and a training module updates the actor. The infrastructure is adapted to GPU memory pressure: model instances are offloaded from VRAM between phases, rollout decoding uses expert parallelism, redundant hot experts, and MTP self-speculative decoding, and training uses length-sorted packing plus DualPipe pipeline parallelism.

Bottleneck

The bottleneck is turning a frontier pretrained model into a reliable long-reasoning policy without paying for human-written reasoning traces at scale. The paper argues that conventional SFT can constrain exploration because human demonstrations may omit reflection, verification, and non-human solution paths. RL can explore, but long-CoT RL creates a different systems problem: each prompt can generate 16 very long responses, reward evaluation can be slow, and the model and reference must be kept close enough to remain trainable.

PPO-style RL also becomes awkward at this scale. GRPO is chosen because it avoids a separate value model, reducing memory and compute overhead. Long outputs make padding waste, VRAM residency, reference-model refresh, and rollout latency first-order bottlenecks. Reliable rewards are another constraint: rule-based verifiers work for math, code, STEM, and logic, while model-based rewards are more vulnerable to reward hacking.

Method Adaptation

DeepSeek-R1-Zero is the cleanest compute adaptation. Starting from DeepSeek-V3-Base, it skips SFT and uses GRPO with rule-based rewards. For each question, GRPO samples 16 outputs and computes relative advantages within the group. R1-Zero uses learning rate 3e-6, KL coefficient 0.001, rollout temperature 1, 32 unique questions per step, and a batch of 512 outputs. The maximum output length is 32,768 tokens before step 8.2K and 65,536 afterward; training runs 10,400 steps, or 1.6 epochs. Each rollout produces 8,192 outputs, split into 16 minibatches for one inner epoch, and the reference model is replaced every 400 steps.

DeepSeek-R1 adapts this raw reasoning policy for readability and product use. The pipeline adds thousands of cold-start long-CoT examples, a first RL stage with language-consistency reward, rejection sampling from the first-stage checkpoint, about 600K reasoning SFT samples, about 200K non-reasoning SFT samples, and a second RL stage mixing rule rewards, preference rewards, and language rewards. The second RL stage lowers rollout temperature to 0.7 because higher temperature caused incoherent generation, and it limits general preference-reward training to the final 400 of 1,700 steps to reduce reward hacking.

Evidence

The training curve is the central evidence. DeepSeek-R1-Zero rises from 15.6% to 77.9% pass@1 on AIME 2024 during RL, and self-consistency decoding raises it to 86.7%. The paper reports that thinking time increases throughout training and that reflective behaviors, including "wait", become more common; Appendix C.2 says representative reflective words rise 5- to 7-fold.

Table 3 shows the final R1 balancing reasoning with instruction following: AIME 2024 pass@1 reaches 79.8, MATH-500 97.3, LiveCodeBench 65.9, Codeforces 96.3 percentile with rating 2029, SWE-Bench Verified 49.2, IF-Eval 83.3, AlpacaEval 2.0 87.6, and ArenaHard 92.3. The same table also shows why the staged pipeline matters: R1-Zero is strong at math but weak on instruction following, while final R1 improves general user-preference benchmarks after mixed SFT and RL.

The evaluation setup is also compute-shaped. DeepSeek-R1 outputs are capped at 32,768 tokens per benchmark. The paper decontaminates pretraining and post-training data with n-gram filtering, removes about six million potential math pretraining texts, and sources mathematical SFT/RL prompts from pre-2023 competitions.

Historical Effect

DeepSeek-R1 made long-CoT RL a public, open-weight scaling axis. Earlier inference-time reasoning methods mostly spent more decoding samples or wrapped frozen models in search. R1 showed that a pretrained MoE model could be post-trained so that the policy itself dynamically spends more tokens on hard problems and learns reflection and self-verification behavior.

Historically, the paper also made the systems cost of reasoning models visible: long outputs, verifier latency, reward design, expert-parallel rollout, model offload, and GPU-hour budgets are part of the method. GRPO became important because it matched the memory and rollout structure of H800-scale long-CoT RL better than a value-model-heavy PPO pipeline.

Limits

The paper is explicit about remaining bottlenecks. DeepSeek-R1 has weaker structured output and tool use than some existing models, and it cannot use search engines or calculators. It still overthinks simple questions, so token efficiency remains open. It is optimized mainly for Chinese and English and may mix languages elsewhere. It is sensitive to prompts, with few-shot prompting hurting performance in their evaluation. Software engineering RL is limited because evaluations take too long, so large-scale RL was not extensively applied there. Finally, pure RL depends on reliable rewards; rule-based rewards work for verifiable tasks, but model-based rewards can be hacked.