Kimi k1.5: Scaling Reinforcement Learning with LLMs

Metadata

Reading status: read complete
Year: 2025
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2025-kimi_k1_5_2025.pdf
Extracted text: 2025-kimi_k1_5_2025.txt
PDF URL: https://arxiv.org/pdf/2501.12599.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

The extracted Kimi k1.5 report does not name a specific GPU model, device count, wall-clock training time, or total FLOPs. It does, however, describe a GPU-based long-context RL system: Megatron and vLLM run in Kubernetes sidecar containers sharing available GPUs, GPU memory is offloaded between training and inference phases, CUDA graphs, NCCL buffers, and NVIDIA drivers create practical offload issues, and checkpoints move over RDMA with Mooncake. Under the project rule, the exact hardware is therefore inferred from the local 2024 frontier accelerator context as H100/H200-class datacenter GPU infrastructure, not as a paper-stated device claim.

The paper's stated compute setup is the software and serving topology. Kimi k1.5 trains through pretraining, vanilla SFT, long-CoT SFT, and RL. The base model is activated for long context by training one epoch at 32K sequence length and one epoch at 128K, packing multiple examples into each sequence. The RL system alternates rollout and training phases, uses rollout workers, trainer workers, reward-model evaluation, a replay buffer, and a code execution service. The final run scales to 128K context, making output tokens and KV/context memory the central compute currency.

Bottleneck

The paper frames the bottleneck as continued scaling after next-token pretraining runs into finite high-quality data. Instead of only scaling model size and pretraining tokens, k1.5 scales the amount of reasoning a model can perform during training and inference. Long-CoT makes language tokens behave like the search budget in a planning algorithm: more context permits more implicit search steps, backtracking, reflection, and correction.

That shift creates systems bottlenecks. Very long rollouts can monopolize rollout workers, make iteration time depend on the slowest sample, and waste GPU memory when training and inference workers sit on separate resources. Reward computation is also hard: coding rewards need execution, math answers may be equivalent under different symbolic forms, and multimodal problems require visual reasoning. Overthinking becomes a token-efficiency problem because longer chains improve accuracy but cost more at training and serving time.

Method Adaptation

Kimi adapts RL to the long-context device by flattening planning into the model context instead of running explicit tree search. The report says strong performance can be achieved without Monte Carlo tree search, value functions, or process reward models. Policy optimization is a variant of online policy mirror descent: each iteration uses the current model as reference, samples k responses per problem, compares rewards against the sampled mean baseline, and applies a regularized gradient. The system intentionally excludes a value network, both for efficiency and to avoid penalizing exploratory wrong turns that later recover to a correct answer.

Several mechanisms are direct adaptations to the token and GPU bottlenecks. A length reward restrains overthinking. Curriculum sampling starts on easier tasks before focusing on hard ones, while prioritized sampling raises the probability of problems with low observed success. Partial rollouts impose a fixed output-token budget per rollout; unfinished trajectories are stored in a replay buffer and resumed in later iterations, so only the newest segment is on-policy and earlier segments are reused. Hybrid deployment lets Megatron train, offload, transfer weights to vLLM through Mooncake, roll out, terminate vLLM, and reload training memory. For code RL, the sandbox uses Kubernetes, crun, pre-created cgroups, and tmpfs-backed overlay storage to reduce executor latency.

The long2short pipeline adapts the same reasoning compute to lower-latency serving. It uses model merging, shortest rejection sampling, DPO, and a separate long2short RL phase with length penalty and shorter rollout limits to transfer long-CoT priors into short-CoT models.

Evidence

The headline long-CoT model reaches 77.5 on AIME 2024, 96.2 on MATH-500, 94th percentile on Codeforces, 62.5 on LiveCodeBench, 74.9 on MathVista, 70.0 on MMMU, and 38.6 on MathVision. The short-CoT model reaches 60.8 on AIME 2024, 94.6 on MATH-500, 47.3 on LiveCodeBench, and 87.2 on IF-Eval. Figure 5 and Figure 6 support the scaling claim: response length and performance increase during RL, and the final run scales to 128K context with continued improvement on hard reasoning benchmarks.

The ablations reinforce the compute interpretation. The report says a smaller model can become comparable to a larger one by using longer RL-optimized CoTs, but the larger model remains more token efficient. It also reports that the proposed RL method has better sample complexity than ReST because negative gradients penalize incorrect responses. For reward computation, the Chain-of-Thought reward model reaches about 98.5 accuracy in manual spot checks, versus 84.4 for a classic value-head reward model. For code execution, the sandbox improves container startup from 0.12 seconds with Docker to 0.04 seconds and maximum container starts from 27/sec to 120/sec on a 16-core machine. For serving efficiency, long2short RL reaches 60.8 AIME pass@1 while using 3,272 tokens on average.

Historical Effect

Kimi k1.5 is a marker for the "reasoning as long-context RL" branch of 2025 systems work. Its historical point is not just benchmark parity with o1-like models; it shows how to make a long-CoT model trainable on a production-style GPU cluster by sharing devices between Megatron and vLLM, moving weights quickly, reusing partial trajectories, and making reward environments scalable.

The report also clarifies a methodological split. Rather than building explicit search trees at inference time, k1.5 trains the model to internalize planning inside a long autoregressive context. That made context length, rollout scheduling, reward latency, and token efficiency central method variables for reasoning models.

Limits

The main compute limitation is evidentiary: the paper does not disclose exact hardware, chip count, total training cost, model size, or full architecture details, so device-level cost cannot be reconstructed. Long-context RL efficiency remains an open question in the conclusion. Overthinking is still a live problem, and the authors explicitly point to future work on credit assignment and reducing overthinking without damaging exploration. Reward reliability is also fragile: the system needs verifiable prompts, math reward models, generated code tests, and sandboxed execution. Finally, the long2short results improve token efficiency but do not remove the need for expensive long-CoT teachers and RL iterations.