Mixture of experts
英文原文文件:moe.md
计算解释
见英文原文文件。本中文镜像只同步阅读卡链接,避免引入英文源文件中没有的额外论断。
支撑阅读卡
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017,
sparse_memory_efficient_scaling) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020,
sparse_memory_efficient_scaling) - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2021,
sparse_memory_efficient_scaling) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (2021,
sparse_memory_efficient_scaling) - Mixtral of Experts (2024,
sparse_memory_efficient_scaling) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024,
sparse_memory_efficient_scaling) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025,
inference_time_compute_post_training) - Qwen3 Technical Report (2025,
hyperscale_dense_llm_training) - Kimi K2: Open Agentic Intelligence (2025,
sparse_memory_efficient_scaling) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025,
sparse_memory_efficient_scaling) - Kimi K2.5: Visual Agentic Intelligence (2026,
inference_time_compute_post_training) - Qwen3.5-Omni Technical Report (2026,
generative_media_compute)
后续计算范式下过时或退居次要的内容
仅通过已链接的阅读卡追踪,不将本方法页视为独立证据来源。