Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
Today’s class
• In the last class, we saw that parameter estimation for the linear
regression model is possible in closed form
• This is not always the case for all ML models. What do we do in those
cases?
• We treat the parameter estimation problem as a problem of function
optimization
• There is lots of math, but it’s very intuitive Loss function
• Don’t be intimidated
Minima
ML params
Magnitude of derivative at a point is the rate of change of the func at that point
=
𝑓 (𝑥)
Sign is also important: Positive derivative
means is increasing at if we increase the ∆ 𝑓 (𝑥)
value of by a very small amount; negative ∆ 𝑓 (𝑥)
derivative means it is decreasing
∆𝑥 ∆𝑥 𝑥
Understanding how changes its value as we
change is helpful to understand optimization
(minimization/maximization) algorithms
Derivative becomes zero at stationary points (optima or saddle points)
The function becomes “flat” ( if we change by a very little at such points)
These are the points where the function has its maxima/minima (unless they are saddles)
CS771: Intro to ML
5
Rules of Derivatives
Some basic rules of taking derivatives
Sum Rule:
Scaling Rule: if is not a function of
Product Rule:
Quotient Rule:
Chain Rule:
CS771: Intro to ML
6
Derivatives
How the derivative itself changes tells us about the function’s optima
𝑓’(𝑥)= 0 and
𝑓’(𝑥)= 0 at 𝑥 is a maxima
𝑓’(𝑥)= 0 just before 𝑥
𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)>0 𝑓’(𝑥)= 0 just after 𝑥
just before 𝑥 𝑓’(𝑥)<0 𝑓’(𝑥)= 0 and
𝑥 may be a saddle
just after 𝑥 is a minima
𝑥 is a maxima
𝑓’(𝑥)= 0 and
may be a saddle. May need
𝑓’(𝑥)= 0 at 𝑥 higher derivatives
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑥 is a minima
A saddle
point
Saddle is a point of
inflection where the
derivative is also zero
Saddle points are very common for loss functions of deep learning models
Need to be handled carefully during optimization
( )
𝜕𝑓 𝜕𝑓 𝜕𝑓 Each element in this gradient vector tells us
∇ 𝑓 ( 𝒙 )= , ,…, how much will change if we move a little
𝜕 𝑥1 𝜕 𝑥2 𝜕 𝑥𝐷 along the corresponding (akin to one-dim case)
CS771: Intro to ML
10
The Hessian
For a multivar scalar valued function , Hessian is a matrix
[ ]
2
𝜕 𝑓
2
𝜕 𝑓 𝜕2 𝑓 Note: If the function itself is vector
valued, e.g., then we will have
𝜕 𝑥 21 𝜕 𝑥1 𝑥2 … 𝜕 𝑥1 𝑥 𝐷
2 2
𝜕 𝑓 … 𝜕2 𝑓 such Hessian matrices, one for each
𝜕 𝑓
2
𝛻 𝑓 ( 𝒙 )= 𝜕 𝑥 𝑥 𝜕 𝑥2 𝑥 𝐷 output dimension of
2 1 𝜕 𝑥 22
⋮ ⋮ ⋱ ⋮
Gives information 2 2
2
𝜕 𝑓
A square, symmetric matrix M is
𝜕 𝑓 𝜕 𝑓
about the curvature …
𝜕 𝑥𝐷
2 PSD if
𝜕 𝑥𝐷 𝑥1 𝜕 𝑥 𝐷 𝑥2 PSD if all
of the function at Will be NSD if
eigenvalues are
point
non-negative
The Hessian matrix can be used to assess the optima/saddle points
= 0 and is a positive semi-definite (PSD) matrix then is a minima
= 0, and is a negative semi-definite (NSD) matrix then is a maxima
CS771: Intro to ML
11
Convex and Non-Convex Functions
A function being optimized can be either convex or non-convex
Here are a couple of examples of convex functions
Convex functions are bowl-shaped.
They have a unique optima
(minima)
CS771: Intro to ML
12
Convex Sets
A set S of points is a convex set, if for any two points , and 0 ≤ ≤ 1
is also called a “convex
combination” of two points 𝑧 =𝛼 𝑥+ ( 1 −𝛼 ) 𝑦 ∈ 𝑆 Can also define convex
combination of points as
Above means that all points on the line-segment between and lie within
CS771: Intro to ML
14
Optimization Using First-Order Optimality
Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take
First order optimality: The gradient must be equal to zero at the optima
=0
Sometimes, setting and solving for gives a closed form solution
If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
15
Optimization via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
Initialize as set carefully (fixed
or chosen
adaptively). Will
For iteration (or until convergence) discuss some
Calculate the gradient using the current iterates strategies later
Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
16
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction
Positive gradient
here. Let’s move in
the negative direction
𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )
∗
𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Stuck at a local
minima
Good initialization
is very important CS771: Intro to ML
17
GD: An Example
Let’s apply GD for least squares linear regression
=
Training
The gradient: examples on
Prediction error of current model which the current
Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update
CS771: Intro to ML