DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Download PDF

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Metadata

Compute Setup

DeepSeek-V3.2 is a 128K-context continuation of DeepSeek-V3.1-Terminus whose main architectural change is DeepSeek Sparse Attention (DSA). The paper explicitly reports H800 GPUs for service benchmarking: inference costs are estimated from the actual DeepSeek-V3.2 service deployed on H800 clusters at a rental price of 2 USD per GPU hour. It does not name the full training cluster. Under the project rule, the training-time device should therefore be interpreted from the frontier accelerator setup about one year before publication: Hopper/H200-class or local H800/H100-style accelerator clusters with high HBM bandwidth, constrained KV-cache memory, and distributed MoE routing. The paper's own H800 serving measurements anchor this interpretation more tightly than a generic 2025 accelerator guess.

The training recipe is compute-explicit even when the hardware is not. DSA is introduced by continued pre-training from a 128K checkpoint. The dense warm-up trains only the lightning indexer for 1,000 steps, with 16 sequences of 128K tokens per step, totaling 2.1B tokens. The sparse stage then trains the main model and indexer for 15,000 steps, with 480 sequences of 128K tokens per step, totaling 943.7B tokens, while each query attends to 2,048 selected key-value tokens. Post-training keeps sparse attention and combines specialist distillation with mixed RL. The authors emphasize that their post-training compute already exceeds 10% of the pre-training cost.

Bottleneck

The bottleneck is not simply parameter count. The paper identifies three constraints on open models: dense attention makes long sequences expensive for both deployment and post-training; open models have invested too little compute in RL and specialist post-training; and agentic models lag in generalization and instruction following when tool trajectories become long and heterogeneous. In practical serving terms, a 128K context turns every additional tool call into a KV-cache and attention-cost problem. In RL terms, large rollout batches create off-policy drift between the optimized inference engine and the training framework, especially for MoE routing and truncated sampling. In agent terms, search and MCP-style tasks can exceed the context window through redundant self-verification before the model reaches a final answer.

Method Adaptation

DSA adapts attention to the accelerator memory problem by separating cheap selection from expensive attention. A small lightning indexer scores previous tokens, then the main attention attends only to the top-k selected key-value entries. It is instantiated under MLA in MQA mode so a key-value entry can be shared across many query heads, improving kernel efficiency. Its technical value is that sparsity is learned against dense-attention behavior and then continued-trained into the model, so long-range retrieval remains content-dependent rather than a fixed window or hand-coded pattern.

The RL stack is likewise a systems adaptation. GRPO is stabilized with an unbiased KL estimate, off-policy sequence masking, Keep Routing for MoE consistency between rollout and training, and Keep Sampling Mask so top-p/top-k truncation defines the same action subspace in both policies. Agent data is scaled with real search APIs, coding tools, Jupyter notebooks, and synthesized environments. For tool-use thinking, DeepSeek retains reasoning traces across tool messages but drops them when a new user message arrives, preserving useful intermediate computation without unbounded context growth. For search overflow, it tests serial context-management methods such as Summary, Discard-75%, and Discard-all, plus a parallel-fewest-step baseline.

Evidence

The DSA evidence is a parity-and-cost claim: DeepSeek-V3.2-Exp preserves short- and long-context benchmark performance relative to DeepSeek-V3.1-Terminus while reducing long-context service cost on H800 clusters. The post-training evidence is broader. DeepSeek-V3.2-Thinking reaches 93.1 on AIME 2025, 92.5 on HMMT Feb 2025, 90.2 on HMMT Nov 2025, 83.3 on LiveCodeBench, 73.1 on SWE Verified, 70.2 on SWE Multilingual, and 46.4 on Terminal Bench 2.0. BrowseComp improves from 51.4 without context management to 67.6 with it, showing that test-time context policy is part of the compute setup, not a cosmetic evaluation choice.

The agent-training data is large and structured: 24,667 code-agent tasks, 50,275 search-agent tasks, 4,417 general-agent tasks, and 5,908 code-interpreter tasks. The general-agent pipeline retains 1,827 synthesized environments and 4,417 tasks with non-zero pass@100. The ablation on synthesized general tasks shows they are difficult even for frontier models: DeepSeek-V3.2-Exp scores 12% pass@1 on a sample of 50 tasks, while the best closed model in the table reaches 62%. Training on synthetic general-agent data improves Tau2Bench, MCP-Mark, and MCP-Universe, whereas code/search-only RL does not transfer as well.

DeepSeek-V3.2-Speciale is evidence for the high-compute frontier of the same recipe. It relaxes length constraints, uses reasoning-only RL, and reaches gold-medal-level performance on IMO 2025, CMO 2025, IOI 2025, and ICPC WF 2025, but at much higher token cost.

Historical Effect

DeepSeek-V3.2 is historically useful because it joins three compute adaptations that are often treated separately: sparse long-context attention, high-budget RL, and agentic environment synthesis. The paper frames open-model competitiveness as a resource-allocation problem: efficient attention makes long contexts and RL rollouts affordable; stable RL makes post-training compute scale; synthesized environments turn agent behavior into verifiable training data. Technically, it contributes a learned sparse-attention transition, MoE-aware GRPO stabilization, and a reproducible agent-data recipe, not just cheaper serving. It also makes context management a measurable form of inference-time compute.

Limits

The paper is candid that DeepSeek-V3.2 still trails the strongest closed models in world knowledge and complex task solving, and it attributes part of the gap to fewer total training FLOPs. Token efficiency is also unresolved: the official model uses stricter length constraints to control deployment cost and latency, while Speciale improves reasoning by spending more tokens. The 128K window remains a hard constraint for agentic search; more than 20% of some search cases exceed it without context management. Some tool frameworks simulate tool outputs as user messages, which reduces the benefit of DeepSeek's reasoning-retention policy. Finally, the training hardware is not fully specified, so the compute interpretation relies on the paper's H800 serving benchmark plus the repo's frontier-accelerator rule.

Links