Large-scale machine learning with stochastic gradient descent
Large-Scale Machine Learning with Stochastic Gradient Descent
Metadata
- Slug:
sgd_large_scale_2010 - Year: 2010
- Venue: COMPSTAT
- Author: Leon Bottou
- Reading status: read complete
- Compute regime: Pre-2012 CPU and statistical foundations
- Primary sources: PDF, extracted text
Compute Setup
The paper does not name a specific processor, server, or accelerator. By the project rule, the device context is inferred from the 2009 research period: CPU-based large-scale learning on commodity research machines or clusters, before GPU deep learning became the default training path. This is an inference about the era, not a hardware claim made by the paper.
The paper explicitly frames the compute problem: during the preceding decade, data sizes had grown faster than processor speed, so statistical learning methods were limited by computing time rather than by sample availability. The benchmark examples are also CPU-era in structure: sparse text classification, linear SVM/logistic objectives, and CRF training with millions of sparse parameters rather than dense accelerator kernels.
The central unit of work is an example-level update. Full gradient descent computes the gradient over the whole dataset before each parameter update. SGD is the "drastic simplification": estimate that gradient from one randomly selected example and update immediately. This changes the memory and bandwidth profile from repeated full-dataset scans to streaming access with cheap, noisy updates.
Bottleneck
The bottleneck is wall-clock time under abundant data. Exact empirical-risk minimization on the full dataset can spend too much compute reducing optimization error while processing too few examples. Bottou decomposes expected risk into approximation, estimation, and optimization errors, then places those under constraints on maximum computation time and maximum training-set size.
That decomposition is the compute-structure contribution. In small-scale learning, where examples are scarce and compute is comparatively cheap, it makes sense to drive optimization error close to zero. In large-scale learning, where the limit is Tmax, approximate optimization can yield lower expected risk because it processes more distinct examples in the allowed time. The optimizer is therefore not judged by how accurately it solves the empirical objective alone; it is judged by the risk reached under a time budget.
Sparse data amplifies this argument. RCV1 has hundreds of thousands of documents and 47,152 sparse TF-IDF features. Each stochastic update touches only active features, aligning memory traffic with the data representation. Batch or second-order solvers may lose the wall-clock race before their extra precision matters.
Method Adaptation
SGD fits this setup because it trades exact optimization for fast passes over data:
- Each update touches one example or a small subset rather than the full dataset.
- Approximate optimization can improve expected risk by processing more examples in the same time.
- Averaged SGD and second-order stochastic variants can approach asymptotic efficiency after very few passes.
- The method pairs naturally with sparse high-dimensional text features and linear models.
The paper's adaptation is intentionally device modest: it does not rely on a special accelerator or a dense matrix-multiply kernel. It makes useful progress from streaming examples, with little need to remember which examples have already been visited. Bottou notes that deployed online systems can process examples on the fly, which is a bandwidth and memory advantage as well as a statistical one.
ASGD and second-order stochastic variants are presented as ways to recover some of the asymptotic efficiency of heavier methods without returning to full-batch computation. The compromise is that the stochastic methods accept gradient noise and tune gains or averaging to keep variance under control. This is the same compute bargain later deep learning inherits: spend most of the device budget on many noisy but useful updates, not on exact optimization of a finite training set.
Evidence
- On RCV1 CCAT classification, the training set contains 781,265 documents represented by 47,152 sparse TF-IDF features.
- For hinge-loss SVM on RCV1, the table reports SVMLight at 23,642 seconds with 6.02% test error, SVMPerf at 66 seconds with 6.03%, and SGD at 1.4 seconds with 6.02%.
- For log-loss SVM on the same task, TRON takes 30 or 44 seconds depending on tolerance, while SGD takes 2.3 seconds and reaches 5.66% test error.
- The paper states that expected risk stops improving long before the superlinear TRON optimizer overcomes SGD on the plotted log-loss task.
- On the 2008 Pascal Large Scale Learning Challenge ALPHA task, the training set contains 100,000 patterns with 500 variables, and ASGD nearly reaches the optimal expected risk after a single pass.
- For CONLL 2000 chunking, the CRF has 8,936 training sentences and a
1.68 * 10^6dimensional parameter space. The stochastic methods reach best test performance in a couple minutes; L-BFGS takes 72 minutes for an equivalent solution.
Historical Effect
This paper explains why stochastic first-order optimization became the default training engine for later deep learning. It predates the GPU ImageNet turn, but the lesson transfers directly: when data and models grow, the right optimizer is often the one that spends the device budget on useful noisy updates rather than exact batch optimization.
In the compute spine, Bottou supplies the statistical-compute argument that later hardware makes more dramatic. AlexNet changes the device to GPUs and the primitive to dense convolution, but it keeps minibatch SGD. Transformers change the model class and token scale, but still allocate FLOPs through stochastic updates.
Limits
- The examples are mostly CPU-era convex or structured models, not large GPU-trained neural networks.
- The paper does not address accelerator utilization, mixed precision, distributed synchronization, or model parallelism.
- Later deep learning keeps the SGD principle but changes the device and batching regime.
- The paper's timing numbers should be read as algorithmic evidence within its CPU-era benchmark context, not as a universal ranking on later hardware.
Links
- Parent regime: compute spine
- Later linked card: AlexNet 2012
- Method index: sgd
- Ledger updates: compute bottlenecks