DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Download PDF

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Metadata

  • Slug: deepseek_v2_2024
  • Year: 2024
  • Venue: arXiv
  • Authors: DeepSeek-AI
  • Reading status: read complete
  • Compute regime: Sparse and memory-efficient scaling (sparse_memory_efficient_scaling)
  • Primary sources: PDF, extracted text
  • Reading card created: 2026-06-15

Compute Setup

The paper explicitly states the device class and interconnect: experiments use an NVIDIA H800 cluster; each node has 8 GPUs connected by NVLink and NVSwitch, and nodes communicate through InfiniBand. Training uses the internal HAI-LLM framework with 16-way zero-bubble pipeline parallelism, 8-way expert parallelism, and ZeRO-1 data parallelism.

The model scale is 236B total parameters, 21B activated per token, 60 layers, hidden dimension 5120, 128 attention heads, and 128K context support. Pretraining uses 8.1T tokens. Batch size ramps from 2304 to 9216 during the first 225B tokens, then stays at 9216, with maximum sequence length 4K. Routed experts are deployed on 8 devices, and each token is sent to at most 3. Inference is reported on one 8-H800 node.

Bottleneck

DeepSeek-V2 is built around three compute bottlenecks. The first is KV-cache memory during autoregressive inference. Standard multi-head attention caches keys and values for every generated token at every layer, limiting maximum batch size and sequence length. Long context makes that dominant: 128K context multiplies cache footprint even if model weights fit.

The bound is particularly clear for DeepSeek-V2 because the paper reports the cache formula. FP8 storage for 236B total parameters is about 236 GB before runtime buffers, or about 29.5 GB per GPU on an 8-GPU serving node if evenly sharded. Standard MHA would cache 2 * n_h * d_h * l elements per layer; MLA instead caches (d_c + d_h^R) * l. With 60 layers, d_c + d_h^R = 576, 128K context, and 6-bit average KV storage, MLA cache is about 3.4 GB at batch 1 and about 217 GB at batch 64 across the node. That makes cache compression and batch size first-order serving constraints, not secondary details.

The second bottleneck is MoE communication. Sparse experts reduce activated parameters per token, but expert parallelism scatters routed experts across devices. If selected experts span many devices, all-to-all traffic becomes expensive. Fine-grained expert segmentation raises specialization capacity but increases possible routed experts and communication pressure.

The third bottleneck is training cost and load balance. Sparse models reduce per-token FLOPs only if experts are evenly used and communications are balanced. The paper warns that unbalanced routing can create routing collapse, undertrained experts, and reduced computation efficiency. On H800 clusters, this is an accelerator-utilization problem.

Method Adaptation

The attention adaptation is Multi-head Latent Attention (MLA). Instead of caching full keys and values for every head, MLA jointly compresses keys and values into a latent vector. The paper says MHA caches 2 * n_h * d_h * l elements per token, while MLA caches (d_c + d_h^R) * l. For DeepSeek-V2, d_c is 512, the decoupled RoPE key dimension is 64, and the cache is equivalent to GQA with only 2.25 groups while retaining stronger performance than MHA. MLA also uses low-rank query compression to reduce training activation memory.

The feed-forward adaptation is DeepSeekMoE. Each MoE layer has 2 shared experts and 160 routed experts, with 6 routed experts active per token. Shared experts handle common knowledge while routed experts specialize. The model stores many total parameters but activates only 21B per token, making compute closer to a smaller dense model while preserving capacity.

The systems adaptation is as important as the architecture. Device-limited routing bounds each token's expert targets to at most 3 devices. Expert-level, device-level, and communication balance losses keep compute and traffic balanced. A token-dropping strategy accelerates training on overloaded devices, while no tokens are dropped for evaluation. The infrastructure avoids tensor parallelism because activated parameters are relatively few and some operators are recomputed to save activation memory. The authors also overlap shared-expert computation with all-to-all communication and customize CUDA kernels for communication, routing, fused expert computation, and MLA on an improved FlashAttention-2 base.

For serving, the method continues the memory-first design. Parameters are converted to FP8, and KV cache elements are quantized to 6 bits on average. That makes the model able to serve larger batches because both weights and per-token cache consume less memory bandwidth and capacity.

Evidence

The headline comparison is against DeepSeek 67B. DeepSeek-V2 has 236B total parameters but only 21B activated per token, and the paper reports stronger performance while saving 42.5% training cost, reducing KV cache by 93.3%, and increasing maximum generation throughput to 5.76x. On the H800 cluster, DeepSeek 67B requires 300.6K GPU hours per trillion tokens, while DeepSeek-V2 requires 172.8K.

Inference evidence is similarly concrete. In deployed-style evaluation using the prompt and generation length distribution from the DeepSeek 67B service, a single 8-H800 node runs DeepSeek-V2 above 50K generation tokens per second and above 100K prompt input tokens per second. The paper attributes this to MLA, FP8 parameters, and KV-cache quantization, which together let the service carry larger batches.

Long-context evidence comes from the 128K extension. The paper uses YaRN on decoupled RoPE and reports Needle-in-a-Haystack testing up to 128K, where DeepSeek-V2 performs well across tested lengths. The compute fact is that 128K serving is not just positional encoding; it becomes practical because MLA and low-bit cache storage reduce the per-token memory slope.

Capability tables show that the sparse compute structure is not merely cheaper. With 21B activated parameters, DeepSeek-V2 is reported at a top-tier level among open-source models, though the authors acknowledge a slight basic-English gap against LLaMA 3 70B because of fewer English tokens.

Historical Effect

DeepSeek-V2 is historically important as a systems-shaped LLM architecture. It treats long-context serving, sparse expert capacity, routing balance, and low-bit deployment as one compute structure. Earlier MoE results often emphasized parameter count; this paper emphasizes active parameters, KV-cache elements per token, all-to-all boundaries, GPU-hour cost per trillion tokens, and node-level throughput. Economical scaling here means reducing both training FLOPs and inference memory traffic: MLA attacks the per-token cache, DeepSeekMoE attacks activated FLOPs, and device-limited routing attacks the cluster communication pattern.

Limits

  • The reported benefits depend on a specific H800 cluster, custom framework, custom CUDA kernels, and careful overlap of computation and communication.
  • MoE routing remains implementation-sensitive; poor load balance can reduce utilization or destabilize expert learning.
  • The paper itself notes a slight basic-English capability gap relative to LLaMA 3 70B.
  • The headline inference throughput is measured on the service's prompt/generation length distribution, so workloads with different length or batching patterns may see different gains.

Links