Kimi K2: Open Agentic Intelligence

Metadata

Reading status: read complete
Year: 2025
Compute regime: Sparse and memory-efficient scaling (sparse_memory_efficient_scaling)
PDF: 2025-kimi_k2_2025.pdf
Extracted text: 2025-kimi_k2_2025.txt
PDF URL: https://arxiv.org/pdf/2507.20534.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly states the training hardware. Kimi K2 was trained on an NVIDIA H800 GPU cluster. Each node has 2 TB RAM and 8 H800 GPUs connected by NVLink and NVSwitch; nodes communicate through 8 x 400 Gbps RoCE links. The base model is a 1.04T-parameter MoE Transformer with 32.6B activated parameters, 61 layers, 384 routed experts, 8 active experts per token, one shared expert, Multi-head Latent Attention, and 64 attention heads.

The training system uses 16-way pipeline parallelism with virtual stages, 16-way expert parallelism, and ZeRO-1 data parallelism. BF16 parameters and FP32 gradient buffers require about 6 TB across a 256-GPU model-parallel group, with about 30 GB of model state per GPU. Pretraining processes 15.5T tokens at 4096 context with MuonClip, WSD learning rate scheduling, weight decay 0.1, and a 67M-token global batch. It then adds 400B annealing tokens at 4k sequence length, 60B more at 32k, and uses YaRN to extend context to 128k.

Bottleneck

Kimi K2 is organized around three bottlenecks: sparse trillion-parameter training, optimizer stability, and agentic post-training throughput. The model wants high total capacity but cannot afford dense activation of 1T parameters. MoE sparsity keeps per-token compute near a 32B-activated model, but creates expert-parallel communication and balance problems on the H800 cluster.

The optimizer bottleneck is specific to scaling Muon. The paper reports that vanilla Muon can trigger exploding attention logits at mid scale; in a 9B-activated, 53B-total MoE run, maximum logits quickly exceed 1000, causing risk of loss spikes or divergence. Query-Key Normalization is not suitable for MLA because key matrices are not fully materialized during inference. A 15.5T-token run therefore needs a stability device that does not erase Muon's token-efficiency advantage.

The post-training bottleneck is agentic interaction. Tool use and software engineering data are rare in natural text and expensive to generate from real environments. RL rollout for long-horizon tasks can leave GPUs waiting on VMs, code interpreters, sandboxes, or environment feedback. Weight updates between training and inference engines are also costly at 1T parameters.

Method Adaptation

K2 adapts architecture to H800 memory and interconnect limits through ultra-sparse MoE and MLA. The sparsity scaling law experiments show that, at fixed activated parameters and FLOPs, increasing total experts lowers training and validation loss. K2 adopts sparsity 48 by activating 8 out of 384 experts. It cuts attention heads from DeepSeek-V3's 128 to 64 because at 128k sequence length, doubling heads from 64 to 128 would raise inference FLOPs by 83% while giving only modest validation-loss gains.

MuonClip adapts the optimizer to large sparse training. Its technical value is preserving Muon's token-efficient update while adding a targeted guardrail against attention-logit explosion. QK-Clip monitors each head's maximum attention logit and rescales query/key projection weights after the optimizer update if the value exceeds threshold tau. For MLA it clips only unshared head components, leaving the shared rotary key component untouched; it is a post-update control on weight growth.

The systems stack adapts to memory pressure with recomputation, FP8 activation storage, and CPU offload. LayerNorm, SwiGLU, MLA up-projections, and MoE down-projections are recomputed. Inputs of MoE up-projections and SwiGLU are stored as FP8-E4M3 tiles with FP32 scales, while remaining activations stream to CPU RAM with copy engines overlapped with compute and communication.

Post-training adapts the model to agentic workloads by synthesizing tools, agents, tasks, rubrics, and trajectories. The method's value is turning tool use into generated, filtered interactions instead of relying only on scarce human traces. The tool repository includes over 3000 real MCP tools plus over 20,000 synthetic tools. RL combines verifiable rewards with self-critique rubric rewards, budget control, PTX loss, and temperature decay.

Evidence

The central training evidence is stability at scale: K2 is pretrained on 15.5T tokens with zero loss spikes. In the full run with MuonClip and tau = 100, maximum logits are capped early, then decay into a stable range; the loss curve is reported smooth throughout. Appendix evidence says QK-Clip is transiently active: during the first 70,000 steps, 12.7% of heads trigger it at least once, and after that all heads reduce below 100 so clipping becomes inactive.

Architecture evidence supports the sparse design. The paper reports that sparsity 48 can reach validation loss 1.5 with 1.69x, 1.39x, and 1.15x fewer FLOPs than sparsity 8, 16, and 32 respectively. K2 has more total parameters than DeepSeek-V3, 1.04T versus 671B, but fewer activated parameters, 32.6B versus 37B.

Capability evidence focuses on non-thinking and agentic settings. Kimi-K2-Instruct reports 66.1 on Tau2-Bench, 76.5 on ACEBench English, 65.8 on SWE-bench Verified, 47.3 on SWE-bench Multilingual, 53.7 on LiveCodeBench v6, 27.1 on OJBench, 49.5 on AIME 2025, and 75.1 on GPQA-Diamond. SWE-bench Verified also reaches 71.6% with multiple attempts. Long-context evaluation uses 128k context; SWE-bench Verified Agentless uses a 16,384-token output limit.

Historical Effect

Kimi K2 is a historical marker for open-weight agentic MoE scaling. Its technical contribution is the combined recipe: ultra-sparse MoE capacity, MuonClip stability, synthetic tool-use data, and RL that mixes verifiable rewards with self-critique. The compute lesson is that agentic intelligence requires economical 1T-parameter architecture, an optimizer stable for 15.5T tokens, and infrastructure that keeps GPUs useful while environments and tools run.

It also extends the DeepSeek-V2/V3 line of memory-conscious MoE design into a post-training-heavy regime. MLA reduces cache pressure, high sparsity reduces activated FLOPs, fewer attention heads protect 128k inference, and MuonClip protects training stability. The result is an open model whose strongest claims are software engineering and tool-use scores, not just language-model perplexity.

Limits

The paper is explicit about hardware class but not the full cluster size, total GPU-hours, power, or cost. Reproduction would require a large H800 cluster, custom parallelism, activation offload, checkpoint broadcasting, Kubernetes-backed sandboxes, and large-scale synthetic/real agentic environments.

The benchmark framing is also bounded. The report emphasizes non-thinking evaluations; it does not claim extended reasoning mode. Some data points are omitted because evaluation costs were prohibitive. The limitations section notes that for ambiguous tasks or unclear tool definitions, the model may generate excessive tokens and truncate outputs, and that one-shot complete-project prompting is weaker than using K2 inside an agentic framework. Agentic performance therefore depends on scaffolding, tool definitions, and rollout budget, not the base checkpoint alone.