All papers

Papers

132 papers

Sort by:

1986 Learning representations by back-propagating errors Pre-2012 CPU and statistical foundations 30757

1995 Support-vector networks Pre-2012 CPU and statistical foundations 40427

1998 Gradient-based learning applied to document recognition Pre-2012 CPU and statistical foundations 58089

1998 The MNIST database of handwritten digit images for machine learning research Pre-2012 CPU and statistical foundations 4604

2006 Reducing the dimensionality of data with neural networks Pre-2012 CPU and statistical foundations 20914

2006 A fast learning algorithm for deep belief nets Pre-2012 CPU and statistical foundations 16386

2009 ImageNet: A large-scale hierarchical image database Pre-2012 CPU and statistical foundations 61711

2010 Large-scale machine learning with stochastic gradient descent Pre-2012 CPU and statistical foundations 5624

2012 ImageNet Classification with Deep Convolutional Neural Networks Single-GPU deep learning 766 ★

2013 Auto-Encoding Variational Bayes Generative media compute —

2014 Neural Machine Translation by Jointly Learning to Align and Translate Single-GPU deep learning 14620

2014 Sequence to Sequence Learning with Neural Networks Single-GPU deep learning 13358

2014 Adam: A Method for Stochastic Optimization Single-GPU deep learning 84773

2014 Dropout: A Simple Way to Prevent Neural Networks from Overfitting Single-GPU deep learning 34275

2014 Rich feature hierarchies for accurate object detection and semantic segmentation Single-GPU deep learning 31710

2014 Very Deep Convolutional Networks for Large-Scale Image Recognition Single-GPU deep learning 75538

2014 Going Deeper with Convolutions Single-GPU deep learning 1390

2014 Generative Adversarial Nets Generative media compute —

2015 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Single-GPU deep learning 24377

2015 Fast R-CNN Single-GPU deep learning 27853

2015 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Single-GPU deep learning 18238

2015 U-Net: Convolutional Networks for Biomedical Image Segmentation Single-GPU deep learning 88677

2015 Deep Residual Learning for Image Recognition Multi-GPU dense training 4712

2015 Rethinking the Inception Architecture for Computer Vision Multi-GPU dense training 565

2015 Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Generative media compute —

2015 Distilling the Knowledge in a Neural Network Efficient and edge inference —

2015 Human-level control through deep reinforcement learning Search, simulation, and science compute —

2016 Identity Mappings in Deep Residual Networks Multi-GPU dense training 10082

2016 Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Multi-GPU dense training 5668

2016 Layer Normalization Multi-GPU dense training 498

2016 Xception: Deep Learning with Depthwise Separable Convolutions Multi-GPU dense training 358

2016 Densely Connected Convolutional Networks Multi-GPU dense training 1912

2016 Image-to-Image Translation with Conditional Adversarial Networks Generative media compute —

2016 SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size Efficient and edge inference —

2016 Mastering the game of Go with deep neural networks and tree search Search, simulation, and science compute —

2017 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour Multi-GPU dense training 2619

2017 Mixed Precision Training Multi-GPU dense training 880

2017 Attention Is All You Need TPU and accelerator Transformer era 331 ★

2017 In-Datacenter Performance Analysis of a Tensor Processing Unit TPU and accelerator Transformer era 4406

2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer Sparse and memory-efficient scaling —

2017 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks Generative media compute —

2017 Deep Reinforcement Learning from Human Preferences Inference-time compute and post-training —

2017 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Efficient and edge inference —

2017 Mastering the game of Go without human knowledge Search, simulation, and science compute —

2018 Group Normalization Multi-GPU dense training 478

2018 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization Multi-GPU dense training 53

2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding TPU and accelerator Transformer era 1631

2018 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism TPU and accelerator Transformer era —

2018 Mesh-TensorFlow: Deep Learning for Supercomputers TPU and accelerator Transformer era —

2018 A Style-Based Generator Architecture for Generative Adversarial Networks Generative media compute —

2018 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play Search, simulation, and science compute —

2019 EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Multi-GPU dense training 5013

2019 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer TPU and accelerator Transformer era 3692

2019 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations TPU and accelerator Transformer era 984

2019 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context TPU and accelerator Transformer era 3146

2019 RoBERTa: A Robustly Optimized BERT Pretraining Approach TPU and accelerator Transformer era 8

2019 XLNet: Generalized Autoregressive Pretraining for Language Understanding TPU and accelerator Transformer era 1857

2019 What Does BERT Look at? An Analysis of BERT's Attention TPU and accelerator Transformer era —

2019 Language Models are Unsupervised Multitask Learners Hyperscale dense LLM training —

2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Hyperscale dense LLM training —

2019 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Hyperscale dense LLM training —

2019 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Efficient and edge inference —

2019 Mastering Atari, Go, chess and shogi by planning with a learned model Search, simulation, and science compute —

2019 Grandmaster level in StarCraft II using multi-agent reinforcement learning Search, simulation, and science compute —

2019 Dota 2 with Large Scale Deep Reinforcement Learning Search, simulation, and science compute —

2020 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale TPU and accelerator Transformer era —

2020 Language Models are Few-Shot Learners Hyperscale dense LLM training —

2020 Scaling Laws for Neural Language Models Hyperscale dense LLM training —

2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Sparse and memory-efficient scaling —

2020 Big Bird: Transformers for Longer Sequences Sparse and memory-efficient scaling —

2020 Linformer: Self-Attention with Linear Complexity Sparse and memory-efficient scaling —

2020 Longformer: The Long-Document Transformer Sparse and memory-efficient scaling —

2020 Reformer: The Efficient Transformer Sparse and memory-efficient scaling —

2020 Denoising Diffusion Probabilistic Models Generative media compute —

2020 Score-Based Generative Modeling through Stochastic Differential Equations Generative media compute —

2020 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Inference-time compute and post-training —

2020 Learning to summarize from human feedback Inference-time compute and post-training —

2020 REALM: Retrieval-Augmented Language Model Pre-Training Inference-time compute and post-training —

2020 Improved protein structure prediction using potentials from deep learning Search, simulation, and science compute —

2021 Scaling Language Models: Methods, Analysis and Insights from Training Gopher Hyperscale dense LLM training —

2021 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Sparse and memory-efficient scaling —

2021 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Sparse and memory-efficient scaling —

2021 High-Resolution Image Synthesis with Latent Diffusion Models Generative media compute —

2021 Zero-Shot Text-to-Image Generation Generative media compute —

2021 Improved Denoising Diffusion Probabilistic Models Generative media compute —

2021 WebGPT: Browser-assisted question-answering with human feedback Inference-time compute and post-training —

2021 LoRA: Low-Rank Adaptation of Large Language Models Efficient and edge inference —

2021 Highly accurate protein structure prediction with AlphaFold Search, simulation, and science compute —

2022 Training Compute-Optimal Large Language Models Hyperscale dense LLM training —

2022 PaLM: Scaling Language Modeling with Pathways Hyperscale dense LLM training —

2022 BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Hyperscale dense LLM training —

2022 GPT-NeoX-20B: An Open-Source Autoregressive Language Model Hyperscale dense LLM training —

2022 OPT: Open Pre-trained Transformer Language Models Hyperscale dense LLM training —

2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Sparse and memory-efficient scaling —

2022 Scalable Diffusion Models with Transformers Generative media compute —

2022 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Inference-time compute and post-training —

2022 Training language models to follow instructions with human feedback Inference-time compute and post-training —

2022 ReAct: Synergizing Reasoning and Acting in Language Models Inference-time compute and post-training —

2022 Self-Consistency Improves Chain of Thought Reasoning in Language Models Inference-time compute and post-training —

2022 Constitutional AI: Harmlessness from AI Feedback Inference-time compute and post-training —

2022 Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks Inference-time compute and post-training —

2022 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Efficient and edge inference —

2023 Llama 2: Open Foundation and Fine-Tuned Chat Models Hyperscale dense LLM training —

2023 LLaMA: Open and Efficient Foundation Language Models Hyperscale dense LLM training —

2023 Gemini: A Family of Highly Capable Multimodal Models Hyperscale dense LLM training —

2023 Mistral 7B Hyperscale dense LLM training —

2023 Textbooks Are All You Need Hyperscale dense LLM training —

2023 A Survey of Large Language Models Hyperscale dense LLM training —

2023 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Sparse and memory-efficient scaling —

2023 Toolformer: Language Models Can Teach Themselves to Use Tools Inference-time compute and post-training —

2023 Let's Verify Step by Step Inference-time compute and post-training —

2023 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Inference-time compute and post-training —

2023 Tree of Thoughts: Deliberate Problem Solving with Large Language Models Inference-time compute and post-training —

2023 Voyager: An Open-Ended Embodied Agent with Large Language Models Inference-time compute and post-training —

2023 QLoRA: Efficient Finetuning of Quantized LLMs Efficient and edge inference —

2023 Fast Inference from Transformers via Speculative Decoding Efficient and edge inference —

2024 The Llama 3 Herd of Models Hyperscale dense LLM training —

2024 Mixtral of Experts Sparse and memory-efficient scaling —

2024 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Sparse and memory-efficient scaling —

2024 Accurate structure prediction of biomolecular interactions with AlphaFold 3 Search, simulation, and science compute —

2025 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Inference-time compute and post-training —

2025 Kimi k1.5: Scaling Reinforcement Learning with LLMs Inference-time compute and post-training —

2025 s1: Simple test-time scaling Inference-time compute and post-training —

2025 Qwen3 Technical Report Hyperscale dense LLM training —

2025 Gemma 3 Technical Report Efficient and edge inference —

2025 BitNet b1.58 2B4T Technical Report Efficient and edge inference —

2025 Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 Search, simulation, and science compute —

2025 AlphaEvolve: A coding agent for scientific and algorithmic discovery Search, simulation, and science compute —

2025 Kimi K2: Open Agentic Intelligence Sparse and memory-efficient scaling —

2025 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Sparse and memory-efficient scaling —

2026 Kimi K2.5: Visual Agentic Intelligence Inference-time compute and post-training —

2026 Qwen3.5-Omni Technical Report Generative media compute —