Gradient Descent (v2)
Gradient Descent (v2)
……
𝜃1
Gradient Descent
Tip 1: Tuning your
learning rates
i i 1
L i 1
small
Large
Loss Just make
No. of parameters updates
But you can always visualize this.
Adaptive Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 Τ 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
𝜂 𝑡
𝜕𝐿 𝜃
Adagrad 𝜂𝑡 = 𝑔𝑡 =
𝑡+1 𝜕𝑤
Adagrad
𝑡 𝜎 𝑡 : root mean square of
𝜂
𝑤 𝑡+1 ← 𝑤 𝑡 − 𝑡 𝑔𝑡 the previous derivatives of
𝜎 parameter w
Parameter dependent
𝜎 𝑡 : root mean square of
the previous derivatives of
Adagrad parameter w
𝜂 0
𝑤 1 ← 𝑤 0 − 0 𝑔0 𝜎0 = 𝑔0 2
𝜎
𝜂 1
1
1
𝑤 2 ← 𝑤 1 − 1 𝑔1 𝜎 = 𝑔0 2 + 𝑔1 2
𝜎 2
𝜂 2
1
𝑤 3 ← 𝑤 2 − 2 𝑔2 2
𝜎 = 𝑔0 2 + 𝑔1 2 + 𝑔2 2
𝜎 3
……
𝑡
𝜂 𝑡 1
𝑤 𝑡+1 ← 𝑤 𝑡 − 𝑡 𝑔𝑡 𝜎𝑡 = 𝑔𝑖 2
𝜎 𝑡+1
𝑖=0
Adagrad
• Divide the learning rate of each parameter by the
root mean square of its previous derivatives
𝑡
𝜂
𝜂 = 1/t decay
𝜂𝜂𝑡 𝑡 + 1
𝑡+1
𝑤 ← 𝑤 − 𝑡 𝑔𝑡
𝑡
𝜎
𝜎𝑡 𝑡
𝑡
1
𝜎 = 𝑔𝑖 2
𝑡+1
𝑖=0
𝜂
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2
𝜂 𝜕𝐿 𝜃 𝑡
Contradiction? 𝜂𝑡 = 𝑔𝑡 =
𝑡+1 𝜕𝑤
• How surprise it is 反差
特別大
g0 g1 g2 g3 g4 ……
0.001 0.001 0.003 0.002 0.1 ……
g0 g1 g2 g3 g4 ……
10.8 20.9 31.7 12.1 0.1 ……
特別小
𝑡+1 𝑡
𝜂
𝑤 ←𝑤 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2 造成反差的效果
Larger gradient, larger steps?
Best step:
𝑏 |2𝑎𝑥0 + 𝑏|
1st
Larger order |𝑥0 + |
2𝑎 2𝑎
derivative means far
from the minima 𝑥0
𝑏
𝑦 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 −
2𝑎
|2𝑎𝑥0 + 𝑏|
𝜕𝑦 𝑥0
= |2𝑎𝑥 + 𝑏|
𝜕𝑥
Larger 1st order
Comparison between derivative means far
different parameters from the minima
Do not cross parameters
a>b
a
b
𝑤1
𝑤2
c
c>d
d
𝑤1 𝑤2
Second Derivative
Best step:
𝑏 |2𝑎𝑥0 + 𝑏|
|𝑥0 + |
2𝑎 2𝑎
𝑏 𝑥0
𝑦 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 −
2𝑎
|2𝑎𝑥0 + 𝑏|
𝜕𝑦 𝑥0
= |2𝑎𝑥 + 𝑏|
𝜕𝑥
𝑤1 𝑤2
first derivative 2
Gradient Descent
Tip 2: Stochastic
Gradient Descent
Make the training faster
Stochastic Gradient Descent
2
Pick an example xn
2
𝐿𝑛 = 𝑦ො 𝑛 − 𝑏 + 𝑤𝑖 𝑥𝑖𝑛 i i 1 Ln i 1
Loss for only one example
• Demo
Stochastic Gradient Descent
Stochastic Gradient Descent
Gradient Descent Update for each example
Update after seeing all If there are 20 examples,
examples 20 times faster.
See all
examples
Gradient Descent
Tip 3: Feature Scaling
Source of figure:
𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2
𝑥2 𝑥2
𝑥1 𝑥1
100, 200 …… x2 b 1, 2 …… x2 b
w2 Loss L w2 Loss L
w1 w1
Feature Scaling
𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅
𝑥11 𝑥12 For each
𝑥21 𝑥22 dimension i:
…… …… mean: 𝑚𝑖
……
……
……
……
……
standard
deviation: 𝜎𝑖
𝑟
𝑟 𝑥𝑖 − 𝑚𝑖 The means of all dimensions are 0,
𝑥𝑖 ←
𝜎𝑖 and the variances are all 1
Gradient Descent
Theory
Question
• When solving:
𝜃 ∗ = arg min 𝐿 𝜃 by gradient descent
𝜃
2
2
Given a point, we can
1
easily find the point
with the smallest value 0
nearby. How?
1
Taylor Series
• Taylor series: Let h(x) be any function infinitely
differentiable around x = x0.
h k x0
h x x x0 k
k 0 k!
h x0
h x0 h x0 x x0
x x0
2
2!
sin(x)=
……
The approximation
is good around π/4.
Multivariable Taylor Series
h x0 , y0 h x0 , y0
h x, y h x0 , y0 x x0 y y0
x y
+ something related to (x-x0)2 and (y-y0)2 + ……
h x0 , y0 h x0 , y0
h x, y h x0 , y0 x x0 y y0
x y
Back to Formal Derivation
Based on Taylor Series:
If the red circle is small enough, in the red circle
La, b La, b
L La, b 1 a 2 b
1 2
s La, b
La, b La, b
L(θ)
u ,v
1 2 2
a, b
L
s u 1 a v 2 b
1
Back to Formal Derivation
Based on Taylor Series: constant
If the red circle is small enough, in the red circle s La, b
L s u 1 a v 2 b u
La, b
,v
La, b
1 2
Find θ1 and θ2 in the red circle
minimizing L(θ)
1 a 2 b
2 2
d2 L(θ)
2
a, b
Simple, right? d
1
Gradient descent – two variables
Red Circle: (If the radius is small)
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0