| 63 |
2017 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer |
5 |
downloaded / read_complete |
| 64 |
2020 |
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding |
5 |
downloaded / read_complete |
| 65 |
2021 |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
5 |
downloaded / read_complete |
| 66 |
2022 |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |
5 |
downloaded / read_complete |
| 67 |
2021 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
4 |
downloaded / read_complete |
| 68 |
2024 |
Mixtral of Experts |
4 |
downloaded / read_complete |
| 69 |
2020 |
Big Bird: Transformers for Longer Sequences |
3 |
downloaded / read_complete |
| 70 |
2020 |
Linformer: Self-Attention with Linear Complexity |
3 |
downloaded / read_complete |
| 71 |
2020 |
Longformer: The Long-Document Transformer |
3 |
downloaded / read_complete |
| 72 |
2020 |
Reformer: The Efficient Transformer |
3 |
downloaded / read_complete |
| 73 |
2023 |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning |
3 |
downloaded / read_complete |
| 74 |
2024 |
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |
3 |
downloaded / read_complete |
| 129 |
2025 |
Kimi K2: Open Agentic Intelligence |
4 |
downloaded / read_complete |
| 130 |
2025 |
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models |
4 |
downloaded / read_complete |