Single-GPU deep learning

Commodity GPUs make high-throughput dense tensor training practical. CNNs, dropout, batch normalization become dominant.

12 papers Regime 2 of 10

Single-GPU deep learning

Device/setup

One or a few workstation GPUs, most visibly Kepler/Fermi-era NVIDIA cards with limited memory, plus CPU-side data pipelines.

Bottleneck

Fitting deeper networks into GPU memory while keeping convolution, recurrent training, normalization, and detection pipelines numerically stable.

Methods that fit

Convolutions, dropout, Adam, batch normalization, encoder-decoder attention, region-based detectors, U-Net, VGG/Inception-style depth, and GPU-friendly minibatching matched the single-GPU regime.

Methods that became obsolete or less central

CPU-only training and hand-crafted vision/NLP pipelines lost ground where dense GPU kernels could learn features directly.

Representative papers

Rank Year Paper Priority Status
9 2012 ImageNet Classification with Deep Convolutional Neural Networks 10 downloaded / read_complete
10 2014 Neural Machine Translation by Jointly Learning to Align and Translate 8 downloaded / read_complete
11 2014 Sequence to Sequence Learning with Neural Networks 8 downloaded / read_complete
12 2015 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 8 downloaded / read_complete
13 2014 Adam: A Method for Stochastic Optimization 7 downloaded / read_complete
14 2014 Dropout: A Simple Way to Prevent Neural Networks from Overfitting 7 downloaded / read_complete
15 2014 Rich feature hierarchies for accurate object detection and semantic segmentation 7 downloaded / read_complete
16 2014 Very Deep Convolutional Networks for Large-Scale Image Recognition 7 downloaded / read_complete
17 2015 Fast R-CNN 7 downloaded / read_complete
18 2015 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 7 downloaded / read_complete
19 2015 U-Net: Convolutional Networks for Biomedical Image Segmentation 6 downloaded / read_complete
20 2014 Going Deeper with Convolutions 3 downloaded / read_complete

Open questions

  • Separate which gains came from algorithms, which from CUDA kernels and memory layout, and which from larger labeled datasets.

Papers in this compute regime 12