Sources — Compute-Structure AI History

Large-Scale Machine Learning
with Stochastic Gradient Descent

Léon Bottou

NEC Labs America, Princeton NJ 08542, USA
leon@bottou.org

Abstract. During the last decade, the data sizes have grown faster than the speed
of processors. In this context, the capabilities of statistical machine learning meth-
ods is limited by the computing time rather than the sample size. A more pre-
cise analysis uncovers qualitatively different tradeoffs for the case of small-scale
and large-scale learning problems. The large-scale case involves the computational
complexity of the underlying optimization algorithm in non-trivial ways. Unlikely
optimization algorithms such as stochastic gradient descent show amazing perfor-
mance for large-scale problems. In particular, second order stochastic gradient and
averaged stochastic gradient are asymptotically efficient after a single pass on the
training set.

Keywords: Stochastic gradient descent, Online learning, Efficiency

1 Introduction

The computational complexity of learning algorithm becomes the critical
limiting factor when one envisions very large datasets. This contribution ad-
vocates stochastic gradient algorithms for large scale machine learning prob-
lems. The first section describes the stochastic gradient algorithm. The sec-
ond section presents an analysis that explains why stochastic gradient algo-
rithms are attractive when the data is abundant. The third section discusses
the asymptotical efficiency of estimates obtained after a single pass over the
training set. The last section presents empirical evidence.

2 Learning with gradient descent

Let us first consider a simple supervised learning setup. Each example z
is a pair (x, y) composed of an arbitrary input x and a scalar output y. We
consider a loss function `(ŷ, y) that measures the cost of predicting ŷ when the
actual answer is y, and we choose a family F of functions fw (x) parametrized
by a weight vector w. We seek the function f ∈ F that minimizes the loss
Q(z, w) = `(fw (x), y) averaged on the examples. Although we would like
to average over the unknown distribution dP (z) that embodies the Laws of
2 Léon Bottou

Nature, we must often settle for computing the average on a sample z1 . . . zn .
Z n
1X
E(f ) = `(f (x), y) dP (z) En (f ) = `(f (xi ), yi ) (1)
n i=1

The empirical risk En (f ) measures the training set performance. The expected
risk E(f ) measures the generalization performance, that is, the expected
performance on future examples. The statistical learning theory (Vapnik and
Chervonenkis, 1971) justifies minimizing the empirical risk instead of the
expected risk when the chosen family F is sufficiently restrictive.

2.1 Gradient descent
It has often been proposed (e.g., Rumelhart et al., 1986) to minimize the
empirical risk En (fw ) using gradient descent (GD). Each iteration updates
the weights w on the basis of the gradient of En (fw ) ,
n
1X
wt+1 = wt − γ ∇w Q(zi , wt ) , (2)
n i=1

where γ is an adequately chosen gain. Under sufficient regularity assumptions,
when the initial estimate w0 is close enough to the optimum, and when the
gain γ is sufficiently small, this algorithm achieves linear convergence (Dennis
and Schnabel, 1983), that is, − log ρ ∼ t, where ρ represents the residual error.
Much better optimization algorithms can be designed by replacing the
scalar gain γ by a positive definite matrix Γt that approaches the inverse of
the Hessian of the cost at the optimum :
n
1X
wt+1 = wt − Γt ∇w Q(zi , wt ) . (3)
n i=1

This second order gradient descent (2GD) is a variant of the well known
Newton algorithm. Under sufficiently optimistic regularity assumptions, and
provided that w0 is sufficiently close to the optimum, second order gradient
descent achieves quadratic convergence. When the cost is quadratic and the
scaling matrix Γ is exact, the algorithm reaches the optimum after a single
iteration. Otherwise, assuming sufficient smoothness, we have − log log ρ ∼ t.

2.2 Stochastic gradient descent
The stochastic gradient descent (SGD) algorithm is a drastic simplification.
Instead of computing the gradient of En (fw ) exactly, each iteration estimates
this gradient on the basis of a single randomly picked example zt :

wt+1 = wt − γt ∇w Q(zt , wt ) . (4)
Large-Scale Machine Learning 3

The stochastic process { wt , t = 1, . . . } depends on the examples randomly
picked at each iteration. It is hoped that (4) behaves like its expectation (2)
despite the noise introduced by this simplified procedure.
Since the stochastic algorithm does not need to remember which examples
were visited during the previous iterations, it can process examples on the
fly in a deployed system. In such a situation, the stochastic gradient descent
directly optimizes the expected risk, since the examples are randomly drawn
from the ground truth distribution.
The convergence of stochastic gradient descent has been studied exten-
sively in the stochastic approximation literature. P Convergence results
P usually
require decreasing gains satisfying the conditions t γt2 < ∞ and t γt = ∞.
The Robbins-Siegmund theorem (Robbins and Siegmund, 1971) provides the
means to establish almost sure convergence under mild conditions (Bottou,
1998), including cases where the loss function is not everywhere differentiable.
The convergence speed of stochastic gradient descent is in fact limited by
the noisy approximation of the true gradient. When the gains decrease too
slowly, the variance of the parameter estimate wt decreases equally slowly.
When the gains decrease too quickly, the expectation of the parameter es-
timate wt takes a very long time to approach the optimum. Under suffi-
cient regularity conditions (e.g. Murata, 1998), the best convergence speed
is achieved using gains γt ∼ t−1 . The expectation of the residual error then
decreases with similar speed, that is, E ρ ∼ t−1 .
The second order stochastic gradient descent (2SGD) multiplies the gradi-
ents by a positive definite matrix Γt approaching the inverse of the Hessian :

wt+1 = wt − γt Γt ∇w Q(zt , wt ) . (5)

Unfortunately, this modification does not reduce the stochastic noise and
therefore does not significantly improve the variance of wt . Although con-
stants are improved, the expectation of the residual error still decreases like
t−1 , that is, E ρ ∼ t−1 , (e.g. Bordes et al., 2009, appendix).

2.3 Stochastic gradient examples

Table 1 illustrates stochastic gradient descent algorithms for a number of
classic machine learning schemes. The stochastic gradient descent for the
Perceptron, for the Adaline, and for k-Means match the algorithms proposed
in the original papers. The SVM and the Lasso were first described with
traditional optimization techniques. Both Qsvm and Qlasso include a regular-
ization term controlled by the hyperparameter λ. The K-means algorithm
converges to a local minimum because Qkmeans is nonconvex. On the other
hand, the proposed update rule uses second order gains that ensure a fast
convergence. The proposed Lasso algorithm represents each weight as the
difference of two positive variables. Applying the stochastic gradient rule to
these variables and enforcing their positivity leads to sparser solutions.
4 Léon Bottou

Table 1. Stochastic gradient algorithms for various learning systems.

Loss Stochastic gradient algorithm
Adaline (Widrow and Hoff, 1960)
w ← w + γt yt − w> Φ(xt ) Φ(xt )
2
Qadaline = 12 y − w> Φ(x)
d
Φ(x) ∈ R , y = ±1
Perceptron (Rosenblatt, 1957)
yt Φ(xt ) if yt w> Φ(xt ) ≤ 0

Qperceptron = max{0, −y w> Φ(x)} w ← w + γt
0 otherwise
Φ(x) ∈ Rd , y = ±1
K-Means (MacQueen, 1967)
k∗ = arg mink (zt − wk )2
Qkmeans = min 12 (z − wk )2 nk∗ ← nk∗ + 1
k
z ∈ Rd , w1 . . . wk ∈ Rd wk∗ ← wk∗ + n1∗ (zt − wk∗ )
k
n1 . . . nk ∈ N, initially 0
SVM (Cortes and Vapnik, 1995)
if yt w> Φ(xt ) > 1,

λw
2
Qsvm = λw + max{0, 1 − y w Φ(x)} > w ← w − γ t
λw − yt Φ(xt ) otherwise.
Φ(x) ∈ Rd , y = ±1, λ > 0
Lasso (Tibshirani, 1996) ui ← ui − γt λ − (yt − w> Φ(xt ))Φi (xt ) +

1 >
2
Qlasso = λ|w|1 + 2 y − w Φ(x) vi ← vi − γt λ + (yt − w>

t Φ(xt ))Φi (xt ) +
w = (u1 − v1 , . . . , ud − vd ) with notation [x]+ = max{0, x}.
d
Φ(x) ∈ R , y ∈ R, λ > 0

3 Learning with large training sets

Let f ∗ = arg minf E(f ) be the best possible prediction function. Since we
seek the prediction function from a parametrized family of functions F, let
fF∗ = arg minf ∈F E(f ) be the best function in this family. Since we optimize
the empirical risk instead of the expected risk, let fn = arg minf ∈F En (f )
be the empirical optimum. Since this optimization can be costly, let us stop
the algorithm when it reaches an solution f˜n that minimizes the objective
function with a predefined accuracy En (f˜n ) < En (fn ) + ρ.

3.1 The tradeoffs of large scale learning

The excess error E = E E(f˜n ) − E(f ∗ ) can be decomposed in three terms

(Bottou and Bousquet, 2008) :
E = E E(fF∗ ) − E(f ∗ ) + E E(fn ) − E(fF∗ ) + E E(f˜n ) − E(fn ) .

(6)

• The approximation error Eapp = E E(fF∗ ) − E(f ∗ ) measures how closely

functions in F can approximate the optimal solution f ∗ . The approxima-
tion error can be reduced by choosing a larger family of functions.
• The estimation error Eest = E E(fn ) − E(fF∗ ) measures the effect of

minimizing the empirical risk En (f ) instead of the expected risk E(f ).
Large-Scale Machine Learning 5

The estimation error can be reduced by choosing a smaller family of
functions or by increasing the size of the training set.
• The optimization error Eopt = E(f˜n ) − E(fn ) measures the impact of
the approximate optimization on the expected risk. The optimization
error can be reduced by running the optimizer longer. The additional
computing time depends of course on the family of function and on the
size of the training set.
Given constraints on the maximal computation time Tmax and the maximal
training set size nmax , this decomposition outlines a tradeoff involving the size
of the family of functions F, the optimization accuracy ρ, and the number of
examples n effectively processed by the optimization algorithm.

n ≤ nmax
min E = Eapp + Eest + Eopt subject to (7)
F ,ρ,n T (F, ρ, n) ≤ Tmax
Two cases should be distinguished:
• Small-scale learning problems are first constrained by the maximal num-
ber of examples. Since the computing time is not an issue, we can reduce
the optimization error Eopt to insignificant levels by choosing ρ arbitrarily
small, and we can minimize the estimation error by chosing n = nmax . We
then recover the approximation-estimation tradeoff that has been widely
studied in statistics and in learning theory.
• Large-scale learning problems are first constrained by the maximal com-
puting time. Approximate optimization can achieve better expected risk
because more training examples can be processed during the allowed time.
The specifics depend on the computational properties of the chosen op-
timization algorithm.

3.2 Asymptotic analysis
Solving (7) in the asymptotic regime amounts to ensuring that the terms of
the decomposition (6) decrease at similar rates. Since the asymptotic conver-
gence rate of the excess error (6) is the convergence rate of its slowest term,
the computational effort required to make a term decrease faster would be
wasted.
For simplicity, we assume in this section that the Vapnik-Chervonenkis
dimensions of the families of functions F are bounded by a common constant.
We also assume that the optimization algorithms satisfy all the assumptions
required to achieve the convergence rates discussed in section 2. Similar anal-
yses can be carried out for specific algorithms under weaker assumptions (e.g.
Shalev-Shwartz and Srebro, 2008).
A simple application of the uniform convergence results of (Vapnik and
Chervonenkis, 1971) gives then the upper bound
r !
log n
E = Eapp + Eest + Eopt = Eapp + O +ρ .
n
6 Léon Bottou

Table 2. Asymptotic equivalents for various optimization algorithms: gradient
descent (GD, eq. 2), second order gradient descent (2GD, eq. 3), stochastic gradient
descent (SGD, eq. 4), and second order stochastic gradient descent (2SGD, eq. 5).
Although they are the worst optimization algorithms, SGD and 2SGD achieve the
fastest convergence speed on the expected risk. They differ only by constant factors
not shown in this table, such as condition numbers and weight vector dimension.
GD 2GD SGD 2SGD
Time per iteration : n n 1 1
Iterations to accuracy ρ : log ρ1 log log ρ1 1
ρ
1
ρ
Time to accuracy ρ : n log ρ1 n log log ρ1 1
ρ
1
ρ
2
1 1 1 1
Time to excess error E : 1/α log E1 1/α log E1 log log E1 E E
E E

Unfortunately the convergence rate of this bound is too pessimistic. Faster
convergence occurs when the loss function has strong convexity properties
(Lee et al., 2006) or when the data distribution satisfies certain assumptions
(Tsybakov, 2004). The equivalence
α
log n 1
E = Eapp + Eest + Eopt ∼ Eapp + + ρ , for some α ∈ , 1 , (8)
n 2

provides a more realistic view of the asymptotic behavior of the excess er-
ror (e.g. Massart, 2000, Bousquet, 2002). Since the three component of the
excess error should decrease at the same rate, the solution of the tradeoff
problem (7) must then obey the multiple asymptotic equivalences
α
log n
E ∼ Eapp ∼ Eest ∼ Eopt ∼ ∼ ρ. (9)
n
Table 2 summarizes the asymptotic behavior of the four gradient algo-
rithm described in section 2. The first three rows list the computational cost
of each iteration, the number of iterations required to reach an optimization
accuracy ρ, and the corresponding computational cost. The last row provides
a more interesting measure for large scale machine learning purposes. Assum-
ing we operate at the optimum of the approximation-estimation-optimization
tradeoff (7), this line indicates the computational cost necessary to reach a
predefined value of the excess error, and therefore of the expected risk. This
is computed by applying the equivalences (9) to eliminate n and ρ from the
third row results.
Although the stochastic gradient algorithms, SGD and 2SGD, are clearly
the worst optimization algorithms (third row), they need less time than the
other algorithms to reach a predefined expected risk (fourth row). Therefore,
in the large scale setup, that is, when the limiting factor is the computing
time rather than the number of examples, the stochastic learning algorithms
performs asymptotically better !
Large-Scale Machine Learning 7

4 Efficient learning

Let us add an additional example zt to a training set z1 . . . zt−1 . Since the
new empirical risk Et (f ) remains close to Et−1 (f ), the empirical minimum
∗
wt+1 = arg minw Et (fw ) remains close to wt∗ = arg minw Et−1 (fw ). With
sufficient regularity assumptions, a first order calculation gives the result
∗
= wt∗ − t−1 Ψt ∇w Q(zt , wt∗ ) + O t−2 ,

wt+1 (10)

where Ψt is the inverse of the Hessian of Et (fw ) in wt∗ . The similarity be-
tween this expression and the second order stochastic gradient descent rule
(5) has deep consequences. Let wt be the sequence of weights obtained by
performing a single second order stochastic gradient pass on the randomly
shuffled training set. With adequate regularity and convexity assumptions,
we can prove (e.g. Bottou and LeCun, 2004)

lim t E(fwt ) − E(fF∗ ) = lim t E(fwt∗ ) − E(fF∗ )

= I > 0. (11)
t→∞ t→∞

Therefore, a single pass of second order stochastic gradient provides a pre-
diction function fwt that approaches the optimum fF∗ as efficiently as the
empirical optimum fwt∗ . In particular, when the loss function is the log like-
lihood, the empirical optimum is the asymptotically efficient maximum like-
lihood estimate, and the second order stochastic gradient estimate is also
asymptotically efficient.
Unfortunately, second order stochastic gradient descent is computation-
ally costly because each iteration (5) performs a computation that involves
the large dense matrix Γt . Two approaches can work around this problem.

• Computationally efficient approximations of the inverse Hessian trade
asymptotic optimality for computation speed. For instance, the SGDQN
algorithm (Bordes et al., 2009) achieves interesting speeds using a diag-
onal approximation.
• The averaged stochastic gradient descent (ASGD) algorithm (Polyak and
Juditsky, 1992) performs the normal stochastic
Pt gradient update (4) and
recursively computes the average w̄t = 1t i=1 wt :

t 1
wt+1 = wt − γt ∇w Q(zt , wt ) , w̄t+1 = w̄t + wt+1 . (12)
t+1 t+1

When the gains γt decrease slower than t−1 , the w̄t converges with the
optimal asymptotic speed (11). Reaching this asymptotic regime can take
a very long time in practice. A smart selection of the gains γt helps
achieving the promised performance (Xu, 2010).
8 Léon Bottou

0.25 Expected risk

Algorithm Time Test Error
0.20
Hinge loss SVM, λ = 10−4 .
SVMLight 23,642 s. 6.02 % Training time (secs)
SVMPerf 66 s. 6.03 % 100
SGD 1.4 s. 6.02 % SGD

Log loss SVM, λ = 10−5 . 50
TRON (-e0.01) 30 s. 5.68 % TRON
TRON (-e0.001) 44 s. 5.70 %
SGD 2.3 s. 5.66 % 0.1 0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09
Optimization accuracy (trainingCost−optimalTrainingCost)

Fig. 1. Results achieved with a linear SVM on the RCV1 task. The lower half of
the plot shows the time required by SGD and TRON to reach a predefined accuracy
ρ on the log loss task. The upper half shows that the expected risk stops improving
long before the superlinear TRON algorithm overcomes SGD.

0.40 27.0
SGD SGD
SGDQN SGDQN
ASGD 26.0 ASGD
0.38

25.0

Expected risk Test Error (%)
0.36
24.0
0.34
23.0

0.32
22.0

0.30 21.0
0 1 2 3 4 5 0 1 2 3 4 5
Number of epochs Number of epochs

Fig. 2. Comparaison of the test set performance of SGD, SGDQN, and ASGD for
a linear squared hinge SVM trained on the ALPHA task of the 2008 Pascal Large
Scale Learning Challenge. ASGD nearly reaches the optimal expected risk after a
single pass.

Test loss Test FB1 score
5400 94
SGD
5300 SGDQN 93.8
ASGD
5200 93.6

5100 93.4

5000 93.2

4900 93

4800 92.8

4700 92.6

4600 92.4
SGD
92.2 SGDQN
4500 ASGD
4400 92
0 5 10 15 0 5 10 15
epochs epochs

Fig. 3. Comparison of the test set performance of SGD, SGDQN, and ASGD on a
CRF trained on the CONLL Chunking task. On this task, SGDQN appears more
attractive because ASGD does not reach its asymptotic performance.
Large-Scale Machine Learning 9

5 Experiments

This section briefly reports experimental results illustrating the actual per-
formance of stochastic gradient algorithms on a variety of linear systems.
We use gains γt = γ0 (1 + λγ0 t)−1 for SGD and, following (Xu, 2010),
γt = γ0 (1 + λγ0 t)−0.75 for ASGD. The initial gains γ0 were set manually
by observing the performance of each algorithm running on a subset of the
training examples.
Figure 1 reports results achieved using SGD for a linear SVM trained
for the recognition of the CCAT category in the RCV1 dataset (Lewis
et al., 2004) using both the hinge loss ( Qsvm in table 1), and the log loss,
( Qlogsvm = λw2 + log(1 + exp(−y w> Φ(x))) ). The training set contains 781,265
documents represented by 47,152 relatively sparse TF/IDF features. SGD
runs considerably faster than either the standard SVM solvers SVMLight
and SVMPerf (Joachims, 2006) or the superlinear optimization algorithm
TRON (Lin et al., 2007).
Figure 2 reports results achieved using SGD, SGDQN, and ASGD for
a linear SVM trained on the ALPHA task of the 2008 Pascal Large Scale
Learning Challenge (see Bordes et al., 2009) using the squared hinge loss
( Qsqsvm = λw2 + max{0, 1 − y w> Φ(x)}2 ). The training set contains 100,000
patterns represented by 500 centered and normalized variables. Performances
measured on a separate testing set are plotted against the number of passes
over the training set. ASGD achieves near optimal results after one pass.
Figure 3 reports results achieved using SGD, SGDQN, and ASGD for
a CRF (Lafferty et al., 2001) trained on the CONLL 2000 Chunking task
(Tjong Kim Sang and Buchholz, 2000). The training set contains 8936 sen-
tences for a 1.68 × 106 dimensional parameter space. Performances measured
on a separate testing set are plotted against the number of passes over the
training set. SGDQN appears more attractive because ASGD does not reach
its asymptotic performance. All three algorithms reach the best test set per-
formance in a couple minutes. The standard CRF L-BFGS optimizer takes
72 minutes to compute an equivalent solution.

References
BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful Quasi-
Newton Stochastic Gradient Descent. Journal of Machine Learning Research,
10:1737-1754. With Erratum (to appear).
BOTTOU, L. and BOUSQUET, O. (2008): The Tradeoffs of Large Scale Learning,
In Advances in Neural Information Processing Systems, vol.20, 161-168.
BOTTOU, L. and LECUN, Y. (2004): On-line Learning for Very Large Datasets.
Applied Stochastic Models in Business and Industry, 21(2):137-151
BOUSQUET, O. (2002): Concentration Inequalities and Empirical Processes The-
ory Applied to the Analysis of Learning Algorithms. Thèse de doctorat, Ecole
Polytechnique, Palaiseau, France.
10 Léon Bottou

CORTES, C. and VAPNIK, V. N. (1995): Support Vector Networks, Machine
Learning, 20:273-297.
DENNIS, J. E., Jr., and SCHNABEL, R. B. (1983): Numerical Methods For Un-
constrained Optimization and Nonlinear Equations. Prentice-Hall
JOACHIMS, T. (2006): Training Linear SVMs in Linear Time. In Proceedings of
the 12th ACM SIGKDD, ACM Press.
LAFFERTY, J. D., MCCALLUM, A., and PEREIRA, F. (2001): Conditional Ran-
dom Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.
In Proceedings of ICML 2001, 282-289, Morgan Kaufman.
LEE, W. S., BARTLETT, P. L., and WILLIAMSON, R. C. (1998): The Importance
of Convexity in Learning with Squared Loss. IEEE Transactions on Informa-
tion Theory, 44(5):1974-1980.
LEWIS, D. D., YANG, Y., ROSE, T. G., and LI, F. (2004): RCV1: A New Bench-
mark Collection for Text Categorization Research. Journal of Machine Learn-
ing Research, 5:361-397.
LIN, C. J., WENG, R. C., and KEERTHI, S. S. (2007): Trust region Newton
methods for large-scale logistic regression. In Proceedings of ICML 2007, 561-
568, ACM Press.
MACQUEEN, J. (1967): Some Methods for Classification and Analysis of Multi-
variate Observations. In Fifth Berkeley Symposium on Mathematics, Statistics,
and Probabilities, vol.1, 281-297, University of California Press.
MASSART, P. (2000): Some applications of concentration inequalities to Statistics,
Annales de la Faculté des Sciences de Toulouse, series 6,9,(2):245-303.
MURATA, N. (1998): A Statistical Study of On-line Learning. In Online Learning
and Neural Networks, Cambridge University Press.
POLYAK, B. T. and JUDITSKY, A. B. (1992): Acceleration of stochastic approx-
imation by averaging. SIAM J. Control and Optimization, 30(4):838-855.
ROSENBLATT, F. (1957): The Perceptron: A perceiving and recognizing automa-
ton. Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab.
RUMELHART, D. E., HINTON, G. E., and WILLIAMS, R. J. (1986): Learning
internal representations by error propagation. In Parallel distributed processing:
Explorations in the microstructure of cognition, vol.I, 318-362, Bradford Books.
SHALEV-SHWARTZ, S. and SREBRO, N. (2008): SVM optimization: inverse de-
pendence on training set size. In Proceedings of the ICML 2008, 928-935, ACM.
TIBSHIRANI, R. (1996): Regression shrinkage and selection via the Lasso. Journal
of the Royal Statistical Society, Series B, 58(1):267-288.
TJONG KIM SANG E. F., and BUCHHOLZ, S. (2000): Introduction to the
CoNLL-2000 Shared Task: Chunking. In Proceedings of CoNLL-2000, 127-132.
TSYBAKOV, A. B. (2004): Optimal aggregation of classifiers in statistical learning,
Annals of Statististics, 32(1).
VAPNIK, V. N. and CHERVONENKIS, A. YA. (1971): On the Uniform Con-
vergence of Relative Frequencies of Events to Their Probabilities. Theory of
Probability and its Applications, 16(2):264-280.
WIDROW, B. and HOFF, M. E. (1960): Adaptive switching circuits. IRE
WESCON Conv. Record, Part 4., 96-104.
XU, W. (2010): Towards Optimal One Pass Large Scale Learning with Averaged
Stochastic Gradient Descent. Journal of Machine Learning Research (to ap-
pear).