0% found this document useful (0 votes)
45 views

3 Recitation StochasticGradientDescent

Stochastic gradient descent (SGD) is an optimization algorithm for minimizing loss functions. [SGD] approximates the gradient of the expected loss function using samples, and takes steps proportional to the negative of this estimated gradient. This allows SGD to optimize problems with very large datasets more efficiently than traditional gradient descent. SGD converges to the optimal solution by iteratively updating the weights using a single random sample or mini-batch at each step. The learning rate must be set appropriately for convergence. Backtracking line search can be used to adaptively select the learning rate at each step.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

3 Recitation StochasticGradientDescent

Stochastic gradient descent (SGD) is an optimization algorithm for minimizing loss functions. [SGD] approximates the gradient of the expected loss function using samples, and takes steps proportional to the negative of this estimated gradient. This allows SGD to optimize problems with very large datasets more efficiently than traditional gradient descent. SGD converges to the optimal solution by iteratively updating the weights using a single random sample or mini-batch at each step. The learning rate must be set appropriately for convergence. Backtracking line search can be used to adaptively select the learning rate at each step.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Stochastic Gradient Descent

10701 Recitations 3

Mu Li

Computer Science Department


Cargenie Mellon University

February 5, 2013
The problem

I A typical machine learning problem has a penalty/regularizer


+ loss form
n
1X
min F (w ) = g (w ) + f (w ; yi , xi ),
w n
i=1

xi , w Rp , yi R, both g and f are convex


I Today we only consider differentiable f , and let g = 0 for
simplicity
I For example, let f (w ; yi , xi ) = log p(yi |xi , w ), we are trying
to maximize the log likelihood, which is
n
1X
max log p(yi |xi , w )
w n
i=1
Gradient Descent

I choose initial w (0) , repeat Two dimensional


example:
w (t+1) = w (t) t F (w (t) )

until stop
I t is the learning rate, and
1X
F (w (t) ) = w f (w (t) ; yi , xi )
n
i

I How to stop? kw (t+1) w (t) k  or


kF (w (t) )k 
Learning rate matters

too small t , after 100


t = t, it is too big
iterations
Backtracking line search

Adaptively choose the learning rate


I choose a parameter 0 < < 1
I start with = 1, repeat t = 0, 1, . . .
I while

L(w (t) L(w (t) )) > L(w (t) ) kL(w (t) )k2
2
update =
I w (t+1) = w (t) L(w (t) )
Backtracking line search

A typical choice = 0.8, converged after 13 iterations:


Stochastic Gradient Descent

We name n1 i f (w ; yi , xi ) the empirical loss, the thing we


P
I
hope to minimize is the expected loss

f (w ) = Eyi ,xi f (w ; yi , xi )

I Suppose we receive an infinite stream of samples (yt , xt ) from


the distribution, one way to optimize the objective is

w (t+1) = w (t) t w f (w (t) ; yt , xt )

I On practice, we simulate the stream by randomly pick up


(yt , xt ) from the samples we have
Comparing the average gradient of GD n1 i w f (w (t) ; yi , xi )
P
I
More about SGD

I the objective does not always decrease for each step


I comparing to GD, SGD needs more steps, but each step is
cheaper
I mini-batch, say pick up 100 samples and do average, may
accelerate the convergence
Relation to Perceptron

I Recall Perceptron: initialize w , repeat


(
yi xi if yi hw , xi i < 0
w =w+
0 otherwise

I Fix learning rate = 1, let f (w ; y , x) = max(0, yi hw , xi i),


then (
yi xi if yi hw , xi i < 0
w f (w ; y , x) =
0 otherwise
we derive Perceptron from SGD
Question?

You might also like