0% found this document useful (0 votes)
27 views

4-Optimization of 2 Variables, Gradient Descent

The document discusses optimization of functions with multiple variables. It describes finding stationary points using gradients and classifying them using the Hessian matrix. It also provides an example problem and solution. Gradient descent and its analogy to descending a mountain are explained as algorithmic approaches for optimization.

Uploaded by

SHREYAS YADAV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

4-Optimization of 2 Variables, Gradient Descent

The document discusses optimization of functions with multiple variables. It describes finding stationary points using gradients and classifying them using the Hessian matrix. It also provides an example problem and solution. Gradient descent and its analogy to descending a mountain are explained as algorithmic approaches for optimization.

Uploaded by

SHREYAS YADAV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

22-10-2020

Optimization of function of two variables

• Problem: Minimize 𝑓 𝑥 , 𝑥 = (𝑥 , 𝑥 … … . . 𝑥 ) ∈ 𝑅
• Solution:
Find stationary points 𝑥 ∗ by solving 𝛻𝑓 𝑥 = 0
At each stationary point evaluate Hessian matrix 𝐻(𝑥 ∗ )
which is a matrix containing all second order partial
derivatives of 𝑓 w.r.t. 𝑥 ,𝑥 ……..𝑥

𝑥∗ 𝑥∗ 𝑥∗

𝑥∗ 𝑥∗ 𝑥∗
𝐻 𝑥∗ =
⋮ ⋱ ⋮
∗ ∗
(𝑥 ) 𝑥 ⋯ 𝑥∗
×

32

 Draw conclusion based on following table:

𝐻 𝑥∗ Conclusion
Positive definite Minimum
Negative definite Maximum
indefinite Saddle point
Semidefinite No conclusion

 Recall:
A square matrix is said to be
 positive definite if all its eigen values are positive(>0) ,
 negative definite if all its eigen values are negative(<0)
 Positive semidefinite if all its eigen values are ≥ 0
 Negative semidefinite if all its eigen values are ≤ 0
 indefinite if it has both positive & negative eigen values.

33

1
22-10-2020

Solve and check for H(x*)


• Example: 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦 + 3𝑥𝑦

34

Solve
• Example: 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦 + 3𝑥𝑦

• Hints: 𝛻𝑓 = = &𝐻=

Apply first derivative test to find stationary points:

𝛻𝑓 = 0 =>

Solve simultaneous equations to get stationary points.

Now to check maxima and minima, evaluate Hessian matrix at


these points

35

2
22-10-2020

• Example: 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦 + 3𝑥𝑦

3𝑥 + 3𝑦 6𝑥 3
• Solution: 𝛻𝑓 = = &𝐻= =
3𝑦 + 3𝑥 3 6𝑦

Apply first derivative test to find stationary points:

3𝑥 + 3𝑦 0
𝛻𝑓 = 0 => = => 𝑥 = −𝑦 & 𝑦 = −𝑥
3𝑦 + 3𝑥 0
Solving simultaneously we get 0,0 & (−1, −1) as stationary points.
Now to check maxima and minima we evaluate Hessian matrix at these points

0 3
𝐻 0,0 = Eigen values of these matrix are 3 & -3(Check) i.e H is
3 0
indefinite which implies that (0,0) is saddle point

−6 3
𝐻 −1, −1 = Eigen values of these matrix are -9 & -3(Check) i.e H is
3 −6
negative definite which implies that (−1, −1) is a point of maxima.

36

Application: Least squares


• In machine learning models used for prediction, often
we need to solve the systems 𝐴𝑥 = 𝑏
• This system is many a times inconsistent, in which
case we have to look for the solution such that the
difference between 𝐴𝑥 & 𝑏(called residual) is minimized
i.e. Solve the optimization problem min 𝐴𝑥 − 𝑏

i.e min 𝐴𝑥 − 𝑏
For 𝑝 = 2 , the problem is called (𝐿 )least
squares & the term 𝐴𝑥 − 𝑏 is called residual
sum of squares(RSS)

37

3
22-10-2020

• Example: Solve 𝐴𝑥 = 𝑏 where

2 0 1
𝐴 = −1 1 &𝑏= 0
0 2 −1
• Solution:
Observe that the system is inconsistent.
We will find the least squares solution.
The objective function is

𝐴𝑥 − 𝑏 = 2𝑥 − 1 + −𝑥 + 𝑦 + 2𝑦 + 1
Finding the minima by the method seen earlier , we get the

solution as 𝑥 = ,𝑦 = − (Check)

38

Algorithmic approach:
• In practice, computing and storing the full Hessian matrix takes large
memory, which is infeasible for high-dimensional functions such as
the loss function with large numbers of parameters. For such
situations, first order algorithmic methods like gradient descent or
second order methods like Newton’s method have been developed.

• General structure of algorithms for unconstrained minimization :


• Choose a starting point 𝑥
• Beginning at 𝑥 , generate a sequence of iterates 𝑥 with non-
increasing function 𝑓 value until a solution point with sufficient
accuracy is found or until no further progress can be made.
• To generate the next iterate 𝑥 , the algorithm uses information about
the function at 𝑥 and possibly earlier iterates.

39

4
22-10-2020

Gradient Descent:
• Algorithmic method for finding a local minimum of a
differentiable function.
• The algorithm is initiated by choosing random values to the
parameters
• Improve the parameters gradually by taking steps proportional
to the negative of the gradient (or approximate gradient) of the
cost function at the current point.
• Continue the process until the algorithm converges to a
minimum i.e until the difference between the successive
iterates becomes stable or reaches a threshold
 Note: If we instead take steps proportional to the positive of
the gradient, we approach a local maximum of that function;
the procedure is then known as gradient ascent.

40

Analogy:
• To get an idea of how Gradient Descent works, let us take an
example.

• Suppose you are at the top of a mountain and want to reach


the base camp which is all the way down at the lowest point
of the mountain. Also, due to the bad weather, the visibility
is really low and you cannot see the path at all. How would
you reach the base camp?

41

5
22-10-2020

Analogy:

• One of the ways is to use your feet to know where the land

tends to descend. This will give an idea in what direction,

the steep is low and you should take your first step. If you

follow the descending path until you encounter a plain area

or an ascending path, it is very likely you would reach the

base camp.

42

Analogy:

43

6
22-10-2020

Limitation:
• If there is a slight rise in the ground when you are going

downhill you would immediately stop assuming that you

reached the base camp (global minima), but in reality, you

are still stuck at the mountain at a local minima.

• In other words , gradient descent does not guarantee finding

global mimina(maxima) of the function

• However most of the objective functions used in machine

learning such as cost function are convex functions which

ensure that the local minimum is also a global minimum.

44

Mathematical formulation of the idea:

• The algorithm is based on the fact that at any


given point ′𝑥 ′ in the domain of the function 𝑓(𝑥),
the function decreases fast in the direction of
negative gradient and increases in the opposite
direction.

• If one goes from position 𝑎 to position 𝑎 by


going in the direction of negative gradient with
step size/length 𝛾 i.e. 𝑎 = 𝑎 − 𝛾𝛻𝑓(𝑎 ) then he
will be going towards the minimum.

45

7
22-10-2020

Mathematical formulation of the idea:

• It can be shown that if 𝛾 is small then 𝑓 𝑎 ≥


𝑓 𝑎 . So if 𝑥 is the starting point of the
algorithm followed by the sequence
𝑥 , 𝑥 , 𝑥 … … then 𝑥 = 𝑥 −𝛾 𝛻𝑓 𝑥 𝑤ℎ𝑒𝑟𝑒 𝑓 𝑥 ≥
𝑓 𝑥 ≥𝑓 𝑥 ….

46

Algorithm in machine learning


• In machine learning code we normally use the following
notations :

𝜃
𝜃
• Cost function is denoted by 𝐽 𝜃 where = , 𝜃 is the 𝑖

𝜃
parameter and learning rate is denoted by 𝛼
• so that the iterative formula becomes 𝜃 ≔ 𝜃 − 𝛼𝛻𝐽(𝜃). If we
apply this formula individually to the components of 𝜃 then
the formula for 𝑗 component 𝜃 becomes

• 𝜃 := 𝜃 − 𝛼

47

8
22-10-2020

Note on step size:


• The value of step size 𝛾 can be changed at every
iteration.(Hence the notation 𝛾 ).

• In machine learning the value 𝛾 is called the learning


rate(which can be varied).

• Usually, we take the value of the learning rate to be small


such as 0.1, 0.01,0.001 etc..

• The value of the step should not be too big as it can skip the
minimum point and thus the optimization can fail. It is a
hyper-parameter and you need to experiment with its
values.

48

Note on step size:

49

9
22-10-2020

Additional Example:Single variable case

Minimize 𝑓 𝑥 = 𝑥

Solution:

We are given a function of one variable. Here cost function 𝐽 𝜃 = 𝜃


and there is only one parameter so that 𝜃 = [𝜃 ]

From our cost function 𝐽 𝜃 , we can clearly say that it will be


minimum at 𝜃 = 0, but it won’t be so easy to derive such conclusions
while working with some complex functions, so we will apply gradient
descent here.

Step 1: Initialize 𝜃 by a random number say 𝜃 = 5 and let the


learning rate 𝛾 = 0.1

50

Example:Single variable case

Step 2: Simplification of the iteration formula: 𝜃 : = 𝜃 − 𝛼

𝜃: = 𝜃 − 𝛼

𝜕 𝜃
𝜃 ≔ 𝜃 − 0.1
𝜕𝜃
𝜃: = 𝜃 − 0.1 ∗ (2𝜃)
𝜃: = 0.8 ∗ 𝜃

51

10
22-10-2020

• ( F ) Table Generation:
• Here we are stating with θ = 5.keep in mind that here θ = 0.8*θ, for our learning rate and
cost function.

𝜃 𝐽(𝜃) 𝜃 𝐽(𝜃)

5 25 -5 25

4 16 -4 16

3.2 10.24 -3.2 10.24

2.56 6.55 -2.56 6.55

2.04 4.19 -2.04 4.19

⋮ ⋮ ⋮ ⋮

0 0 0 0

• We can see that, as we increase our number of iterations, our cost value goes down and the
algorithm converges to the optimum value 0

52

Additional Example:Two variable case

• Example 2(Two Variables case):


• Minimize 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦
• Solution:

𝜃
• Our cost function is : 𝐽 𝜃 = 𝜃 +𝜃 where 𝜃 = & let the learning
𝜃
rate 𝛼 = 0.1

• Increment function: 𝜃 ≔ 𝜃 − 𝛼 & 𝜃 ≔𝜃 −𝛼

( ) ( )
• 𝜃 ≔ 𝜃 − 0.1 ∗ & 𝜃 ≔ 𝜃 − 0.1 ∗

• 𝜃 ≔ 𝜃 − 0.1 ∗ (2𝜃 ) & 𝜃 ≔ 𝜃 − 0.1 ∗ (2𝜃 )


• 𝜃 ≔ 0.8 ∗ 𝜃 & 𝜃 ≔ 0.8 ∗ 𝜃

53

11
22-10-2020

• Initialize 𝜃 = 1 & 𝜃 = 1 and iterate

𝜃 𝜃 𝐽 𝜃

1 1 2

0.8 0.8 1.28

0.64 0.64 0.4096

0.512 0.512 0.2621

0.4096 0.4096 0.1677

⋮ ⋮ ⋮

0 0 0

• We can see that, as we increase our number of iterations,


our cost value goes down and the algorithm slowly
converges to the optimum value (0,0)

54

55

55

12

You might also like