← All methods

Mixture of experts

Compute interpretation

Conditional compute architecture that increases parameter count without activating all weights per token.

Supporting reading cards

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017, sparse_memory_efficient_scaling)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020, sparse_memory_efficient_scaling)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2021, sparse_memory_efficient_scaling)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (2021, sparse_memory_efficient_scaling)
Mixtral of Experts (2024, sparse_memory_efficient_scaling)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024, sparse_memory_efficient_scaling)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025, inference_time_compute_post_training)
Qwen3 Technical Report (2025, hyperscale_dense_llm_training)
Kimi K2: Open Agentic Intelligence (2025, sparse_memory_efficient_scaling)
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025, sparse_memory_efficient_scaling)
Kimi K2.5: Visual Agentic Intelligence (2026, inference_time_compute_post_training)
Qwen3.5-Omni Technical Report (2026, generative_media_compute)

Obsolete or less central under later compute

Track this only through linked reading cards; do not treat this method page as standalone evidence.