Kimi K2.5: Visual Agentic Intelligence

Metadata

Reading status: read complete
Year: 2026
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2026-kimi_k2_5_2026.pdf
Extracted text: 2026-kimi_k2_5_2026.txt
PDF URL: https://arxiv.org/pdf/2602.02276.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

Kimi K2.5 is explicit about its training device. The appendix says it is trained on NVIDIA H800 GPU clusters with 8x400 Gbps RoCE interconnects across nodes. The parallel strategy combines 16-way pipeline parallelism with virtual stages, 16-way expert parallelism, and ZeRO-1 data parallelism, with expert all-to-all communication overlapped with computation. To fit activations, the system uses selective recomputation, FP8-E4M3 activation compression, and CPU offload with overlapped streaming.

The model inherits Kimi K2's trillion-parameter MoE base: 1.04T total parameters, 32B activated parameters, 384 experts, and 8 experts activated per token. K2.5 adds a 1T-token ViT stage, 15T vision-text joint pre-training at 4K sequence length, and long-context mid-training that extends sequence length from 32K to 262K. Evaluations use 256K context. The RL infrastructure is also part of the setup: agent tasks are asynchronous coroutines, and a Rollout Manager orchestrates up to 100,000 concurrent tasks with sandboxed tools and environments.

Bottleneck

The paper targets two linked bottlenecks. First, multimodal pre-training has a fixed token budget: adding many visual tokens late can disrupt text competence without producing better vision. Second, agentic execution is sequential by default. Long tool-call chains make latency grow roughly linearly, consume context, and exhaust practical step budgets. Visual agent tasks add another cold-start problem: a pretrained VLM does not automatically know how to use screenshots, IPython, browser tools, or pixel-level operations inside an agent loop.

The systems bottleneck is visible in the infrastructure sections. Multimodal input size varies by image count and resolution, so placing the vision encoder in the first pipeline stage causes load and memory imbalance. In RL, long-horizon tool tasks create sparse delayed rewards and train-inference mismatch. For parallel agents, naive concurrency can become reward hacking: spawning many subagents without meaningful decomposition increases work but not wall-clock progress.

Method Adaptation

K2.5 adapts training to the H800 cluster by making multimodality and parallelism first-class. For vision-language training, the winning ablation is early fusion with a low visual ratio under the same total token budget. MoonViT-3D extends native-resolution image processing to video by packing up to four consecutive frames into a shared spatiotemporal representation and then applying lightweight temporal pooling, yielding 4x temporal compression without a separate video module. Its technical value is shared image/video representation, not compression alone.

For post-training, zero-vision SFT uses text-only SFT to activate visual tool use, with image manipulation proxied through IPython operations. Outcome-based visual RL then trains grounding, counting, chart/document, and vision-critical STEM tasks. Joint multimodal RL organizes domains by ability rather than modality, so knowledge, reasoning, coding, and agent tasks learn from both text and visual traces. Token-efficient RL uses Toggle, alternating budget-limited phases with full inference-time scaling to reduce output tokens without removing high-compute behavior.

Agent Swarm adapts inference-time compute to parallel hardware. PARL trains a central orchestrator while freezing subagents from fixed intermediate checkpoints. The orchestrator learns when to create subagents and assign work; subagent trajectories are observations rather than differentiable targets. Rewards combine task performance, an instantiation reward to avoid serial collapse, and a finish reward to avoid spurious parallelism, with auxiliary weights annealed away. "Critical steps" measure the wall-clock path through parallel groups, so the learned policy is rewarded for shortening latency rather than merely increasing total tool calls.

The Decoupled Encoder Process fixes multimodal pipeline imbalance by running the vision forward pass across all GPUs, discarding intermediate activations, training the backbone with the text-optimized strategy, then recomputing vision for backward gradients. The authors report about 90% multimodal training efficiency relative to text-only training.

Evidence

The multimodal evidence starts with ablations. Under a fixed vision-text token budget, early low-ratio fusion beats mid and late high-ratio fusion across vision knowledge, visual reasoning, OCR, text knowledge, text reasoning, and code. Vision RL also improves text-only benchmarks: MMLU-Pro rises from 84.7 to 86.4, GPQA-Diamond from 84.3 to 86.4, and LongBench v2 from 56.7 to 58.9.

The agentic evidence is stronger than a benchmark table alone because it reports runtime. In wide-search scenarios, Agent Swarm improves item-level F1 from 72.8% to 79.0% and reduces latency by up to 4.5x versus a single-agent baseline. Table 6 reports BrowseComp at 78.4 for Agent Swarm versus 60.6 for single-agent K2.5, WideSearch at 79.0 versus 72.7, and an in-house Swarm Bench at 58.3 versus 41.6. The paper also frames Agent Swarm as proactive context management: subagents keep bounded local memories and return selected results, so the orchestrator avoids global context pollution.

The broader evaluation shows the same compute choices across modes: 96.1 on AIME 2025, 85.0 on LiveCodeBench v6, 76.8 on SWE-Bench Verified, 63.3 on OSWorld-Verified, 92.3 on OCRBench, 79.8 on LongVideoBench with more than 2,000 frames, and 75.9 on LVBench. The qualitative appendix's Black Myth: Wukong stress test splits 24 hours of gameplay across 32 videos and 40GB of media among parallel visual agents, then synthesizes an HTML timeline and showcase.

Historical Effect

Kimi K2.5 marks a shift from "one model thinks longer" to "one model orchestrates more parallel work." It treats agentic latency, visual context, and tool execution as compute-allocation problems. Instead of only increasing token budgets, it adds a trainable orchestration layer that can shard context and wall-clock work across frozen agents. The method value is learned task decomposition with bounded local memories, plus multimodal RL that transfers across text and vision. For the repo, K2.5 is a clear example of inference-time compute becoming distributed, multimodal, and latency-aware.

Limits

The result depends on substantial infrastructure: H800 clusters, high-bandwidth RoCE, MoE parallelism, large asynchronous RL serving, sandboxed tools, custom inference APIs, and subagent interfaces. Agent Swarm is evaluated with explicit create_subagent and assign_task tools, so it is not just a prompt that any deployment can reproduce. Some results are internal, and context-overflow protocols vary by benchmark. The report also notes that human-designed visual trajectories can hurt generalization at SFT time. Terminal Bench 2.0 is evaluated in non-thinking mode because thinking-mode context management is incompatible with the Terminus setting.