4-Optimization of 2 Variables, Gradient Descent
4-Optimization of 2 Variables, Gradient Descent
• Problem: Minimize 𝑓 𝑥 , 𝑥 = (𝑥 , 𝑥 … … . . 𝑥 ) ∈ 𝑅
• Solution:
Find stationary points 𝑥 ∗ by solving 𝛻𝑓 𝑥 = 0
At each stationary point evaluate Hessian matrix 𝐻(𝑥 ∗ )
which is a matrix containing all second order partial
derivatives of 𝑓 w.r.t. 𝑥 ,𝑥 ……..𝑥
𝑥∗ 𝑥∗ 𝑥∗
⋯
𝑥∗ 𝑥∗ 𝑥∗
𝐻 𝑥∗ =
⋮ ⋱ ⋮
∗ ∗
(𝑥 ) 𝑥 ⋯ 𝑥∗
×
32
𝐻 𝑥∗ Conclusion
Positive definite Minimum
Negative definite Maximum
indefinite Saddle point
Semidefinite No conclusion
Recall:
A square matrix is said to be
positive definite if all its eigen values are positive(>0) ,
negative definite if all its eigen values are negative(<0)
Positive semidefinite if all its eigen values are ≥ 0
Negative semidefinite if all its eigen values are ≤ 0
indefinite if it has both positive & negative eigen values.
33
1
22-10-2020
34
Solve
• Example: 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦 + 3𝑥𝑦
• Hints: 𝛻𝑓 = = &𝐻=
𝛻𝑓 = 0 =>
35
2
22-10-2020
• Example: 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦 + 3𝑥𝑦
3𝑥 + 3𝑦 6𝑥 3
• Solution: 𝛻𝑓 = = &𝐻= =
3𝑦 + 3𝑥 3 6𝑦
3𝑥 + 3𝑦 0
𝛻𝑓 = 0 => = => 𝑥 = −𝑦 & 𝑦 = −𝑥
3𝑦 + 3𝑥 0
Solving simultaneously we get 0,0 & (−1, −1) as stationary points.
Now to check maxima and minima we evaluate Hessian matrix at these points
0 3
𝐻 0,0 = Eigen values of these matrix are 3 & -3(Check) i.e H is
3 0
indefinite which implies that (0,0) is saddle point
−6 3
𝐻 −1, −1 = Eigen values of these matrix are -9 & -3(Check) i.e H is
3 −6
negative definite which implies that (−1, −1) is a point of maxima.
36
i.e min 𝐴𝑥 − 𝑏
For 𝑝 = 2 , the problem is called (𝐿 )least
squares & the term 𝐴𝑥 − 𝑏 is called residual
sum of squares(RSS)
37
3
22-10-2020
2 0 1
𝐴 = −1 1 &𝑏= 0
0 2 −1
• Solution:
Observe that the system is inconsistent.
We will find the least squares solution.
The objective function is
𝐴𝑥 − 𝑏 = 2𝑥 − 1 + −𝑥 + 𝑦 + 2𝑦 + 1
Finding the minima by the method seen earlier , we get the
solution as 𝑥 = ,𝑦 = − (Check)
38
Algorithmic approach:
• In practice, computing and storing the full Hessian matrix takes large
memory, which is infeasible for high-dimensional functions such as
the loss function with large numbers of parameters. For such
situations, first order algorithmic methods like gradient descent or
second order methods like Newton’s method have been developed.
39
4
22-10-2020
Gradient Descent:
• Algorithmic method for finding a local minimum of a
differentiable function.
• The algorithm is initiated by choosing random values to the
parameters
• Improve the parameters gradually by taking steps proportional
to the negative of the gradient (or approximate gradient) of the
cost function at the current point.
• Continue the process until the algorithm converges to a
minimum i.e until the difference between the successive
iterates becomes stable or reaches a threshold
Note: If we instead take steps proportional to the positive of
the gradient, we approach a local maximum of that function;
the procedure is then known as gradient ascent.
40
Analogy:
• To get an idea of how Gradient Descent works, let us take an
example.
41
5
22-10-2020
Analogy:
• One of the ways is to use your feet to know where the land
the steep is low and you should take your first step. If you
base camp.
42
Analogy:
43
6
22-10-2020
Limitation:
• If there is a slight rise in the ground when you are going
44
45
7
22-10-2020
46
𝜃
𝜃
• Cost function is denoted by 𝐽 𝜃 where = , 𝜃 is the 𝑖
⋮
𝜃
parameter and learning rate is denoted by 𝛼
• so that the iterative formula becomes 𝜃 ≔ 𝜃 − 𝛼𝛻𝐽(𝜃). If we
apply this formula individually to the components of 𝜃 then
the formula for 𝑗 component 𝜃 becomes
• 𝜃 := 𝜃 − 𝛼
47
8
22-10-2020
• The value of the step should not be too big as it can skip the
minimum point and thus the optimization can fail. It is a
hyper-parameter and you need to experiment with its
values.
48
49
9
22-10-2020
Minimize 𝑓 𝑥 = 𝑥
Solution:
50
𝜃: = 𝜃 − 𝛼
𝜕 𝜃
𝜃 ≔ 𝜃 − 0.1
𝜕𝜃
𝜃: = 𝜃 − 0.1 ∗ (2𝜃)
𝜃: = 0.8 ∗ 𝜃
51
10
22-10-2020
• ( F ) Table Generation:
• Here we are stating with θ = 5.keep in mind that here θ = 0.8*θ, for our learning rate and
cost function.
𝜃 𝐽(𝜃) 𝜃 𝐽(𝜃)
5 25 -5 25
4 16 -4 16
⋮ ⋮ ⋮ ⋮
0 0 0 0
• We can see that, as we increase our number of iterations, our cost value goes down and the
algorithm converges to the optimum value 0
52
𝜃
• Our cost function is : 𝐽 𝜃 = 𝜃 +𝜃 where 𝜃 = & let the learning
𝜃
rate 𝛼 = 0.1
•
( ) ( )
• 𝜃 ≔ 𝜃 − 0.1 ∗ & 𝜃 ≔ 𝜃 − 0.1 ∗
53
11
22-10-2020
𝜃 𝜃 𝐽 𝜃
1 1 2
⋮ ⋮ ⋮
0 0 0
54
55
55
12