Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
Disclaimer:
This PPT is modified based on
Hung-yi Lee
http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17.html
Review: Gradient Descent
• In step 3, we have to solve the following optimization
problem:
𝜃 ∗ = arg min 𝐿 𝜃 L: loss function 𝜃: parameters
𝜃
https://www.khanacademy.org/math/multi
variable-calculus/multivariable-
derivatives/partial-derivative-and-gradient-
articles/a/the-gradient
Review: Gradient Descent
Gradient: Derivative of the Loss function
𝜃2 Gradient descent: direction of negative gradient
𝛻𝐿 𝜃 0
Start at position 𝜃 0
𝜃 0 𝛻𝐿 𝜃 1
Compute gradient at 𝜃 0
𝜃1 𝛻𝐿 𝜃 2
Move to 𝜃 1 = 𝜃 0 - η𝛻𝐿 𝜃 0
Gradient 𝜃2
Compute gradient at 𝜃 1
Movement 𝛻𝐿 𝜃 3
𝜃3
Move to 𝜃 2 = 𝜃 1 – η𝛻𝐿 𝜃 1
……
𝜃1
Gradient Descent
small
Large
Loss Just make
No. of parameters updates
But you can always visualize this.
Adaptive Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 Τ 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
𝜂 𝜕𝐿 𝜃 𝑡
Adagrad 𝜂𝑡 = 𝑔𝑡 =
𝑡+1 𝜕𝑤
Adagrad
𝑡 𝜎 𝑡 : root mean square of
𝜂
𝑤 𝑡+1 ← 𝑤 𝑡 − 𝑡 𝑔𝑡 the previous derivatives of
𝜎 parameter w
𝜂 2
1
𝑤 3 ← 𝑤 2 − 2 𝑔2 𝜎2 = 𝑔0 2 + 𝑔1 2 + 𝑔2 2
𝜎 3
……
𝑡
𝜂 𝑡 1
𝑤 𝑡+1 ← 𝑤 𝑡 − 𝑡 𝑔𝑡 𝜎𝑡 = 𝑔𝑖 2
𝜎 𝑡+1
𝑖=0
Adagrad
• Divide the learning rate of each parameter by the
root mean square of its previous derivatives
𝑡
𝜂
𝜂 = 1/t decay
𝜂𝜂𝑡 𝑡 + 1
𝑡+1
𝑤 ← 𝑤 − 𝑡 𝑔𝑡
𝑡
𝜎
𝜎𝑡 𝑡
𝑡
1
𝜎 = 𝑔𝑖 2
𝑡+1
𝑖=0
𝜂
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2
𝜂 𝜕𝐿 𝜃 𝑡
Contradiction? 𝜂𝑡 = 𝑔𝑡 =
𝑡+1 𝜕𝑤
Tip 2: Stochastic
Gradient Descent
Make the training faster
Stochastic Gradient Descent
2
Pick an example xn
2
𝐿𝑛 = 𝑦ො 𝑛 − 𝑏 + 𝑤𝑖 𝑥𝑖𝑛 i i 1 Ln i 1
Loss for only one example, NO summing
Stochastic Gradient Descent
Stochastic Gradient Descent
Gradient Descent Update for each example
Update after seeing all If there are 20 examples,
examples 20 times faster.
See all
examples
Gradient Descent
𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2
𝑥2 𝑥2
𝑥1 𝑥1
w1 w1
1, 2 …… x1 y 1, 2 …… x1 y
w2 w2
100, 200 …… x2 b 1, 2 …… x2 b
w2 Loss L w2 Loss L
w1 w1
Feature Scaling (normalization with sample size=R)
𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅
𝑥11 𝑥12 For each
𝑥21 𝑥22 dimension i:
…… …… mean: 𝑚𝑖
……
……
……
……
……
standard
deviation: 𝜎𝑖
𝑟
𝑟 𝑥𝑖 − 𝑚𝑖 The means of all dimensions are 0,
𝑥𝑖 ← and the variances are all 1
𝜎𝑖
Warning of Math
Gradient Descent
Theory
Question
• When solving:
𝜃 ∗ = arg 𝑚𝑖𝑛 𝐿 𝜃 by gradient descent
𝜃
2
2
Given a point, we can
1
easily find the point
with the smallest value 0
nearby. How?
1
Taylor Series
• Taylor series: Let h(x) be any function infinitely
differentiable around x = x0.
h k x0
h x x x0 k
k 0 k!
h x0
h x0 h x0 x x0
x x0
2
2!
sin(x)=
……
The approximation
is good around π/4.
Multivariable Taylor Series
h x0 , y0 h x0 , y0
h x, y h x0 , y0 x x0 y y0
x y
+ something related to (x-x0)2 and (y-y0)2 + ……
h x0 , y0 h x0 , y0
h x, y h x0 , y0 x x0 y y0
x y
Back to Formal Derivation
Based on Taylor Series:
If the red circle is small enough, in the red circle
La, b La, b
L La, b 1 a 2 b
1 2
s La, b
La, b La, b
L(θ)
u ,v
1 2 2
a, b
L
s u 1 a v 2 b
1
Back to Formal Derivation
Based on Taylor Series: constant
If the red circle is small enough, in the red circle s La, b
L s u 1 a v 2 b u
La, b
,v
La, b
1 2
Find θ1 and θ2 in the red circle
minimizing L(θ)
1 a 2 b
2 2
d2 L(θ)
2
a, b
Simple, right? d
1
Gradient descent – two variables
Red Circle: (If the radius is small)
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
|2𝑎𝑥0 + 𝑏|
𝜕𝑦 𝑥0
= |2𝑎𝑥 + 𝑏|
𝜕𝑥
Larger 1st order
Comparison between derivative means far
different parameters from the minima
Do not cross parameters
a>b
a
b
𝑤1
𝑤2
c
c>d
d
𝑤1 𝑤2
Second Derivative
Best step:
𝑏 |2𝑎𝑥0 + 𝑏|
|𝑥0 + |
2𝑎 2𝑎
𝑏 𝑥0
𝑦 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 −
2𝑎
|2𝑎𝑥0 + 𝑏|
𝜕𝑦 𝑥0
= |2𝑎𝑥 + 𝑏|
𝜕𝑥
𝑤1 𝑤2
first derivative 2