0% found this document useful (0 votes)
48 views

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

This document discusses optimization techniques for machine learning models. It explains that while parameter estimation can be done in closed form for some models like linear regression, this is not always the case. In those situations, the parameter estimation problem is treated as an optimization problem to minimize a loss function. It introduces concepts like derivatives, gradients, and Hessians that are used to find the minimum of multivariate loss functions through optimization algorithms.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

This document discusses optimization techniques for machine learning models. It explains that while parameter estimation can be done in closed form for some models like linear regression, this is not always the case. In those situations, the parameter estimation problem is treated as an optimization problem to minimize a loss function. It introduces concepts like derivatives, gradients, and Hessians that are used to find the minimum of multivariate loss functions through optimization algorithms.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Optimization for ML

CS771: Introduction to Machine Learning


Nisheeth
2

Today’s class
• In the last class, we saw that parameter estimation for the linear
regression model is possible in closed form
• This is not always the case for all ML models. What do we do in those
cases?
• We treat the parameter estimation problem as a problem of function
optimization
• There is lots of math, but it’s very intuitive Loss function

• Don’t be intimidated
Minima

ML params

Nice reference for today’s material.


For those of you interested in a deeper dive in the math, see Ch 3 in this book CS771: Intro to ML
The objective function of the 3
Functions and their optima
Assume unconstrained
ML problem we are solving for now, i.e., just a real-
(e.g., squared loss for valued number/vector
regression)
 Many ML problems require us to optimize a function of some variable(s)
 For simplicity, assume is a scalar-valued function of a scalar (
A local A local Global
𝑓 (𝑥) maxima maxima maxima
Usually interested in
global optima but often
want to find local optima,
too
A local For deep learning models, often the
A local minima A local local optima are what we can find (and
minima minima 𝑥 they usually suffice) – more later

Will see what


Global these are later
minima
 Any function has one/more optima (maxima, minima), and maybe saddle
points

 Finding the optima or saddles requires derivatives/gradients of the function


CS771: Intro to ML
4
Derivatives Will sometimes use to denote
the derivative

 Magnitude of derivative at a point is the rate of change of the func at that point

=
𝑓 (𝑥)
Sign is also important: Positive derivative
means is increasing at if we increase the ∆ 𝑓 (𝑥)
value of by a very small amount; negative ∆ 𝑓 (𝑥)
derivative means it is decreasing
∆𝑥 ∆𝑥 𝑥
Understanding how changes its value as we
change is helpful to understand optimization
(minimization/maximization) algorithms
 Derivative becomes zero at stationary points (optima or saddle points)
 The function becomes “flat” ( if we change by a very little at such points)
 These are the points where the function has its maxima/minima (unless they are saddles)
CS771: Intro to ML
5
Rules of Derivatives
Some basic rules of taking derivatives
 Sum Rule:
 Scaling Rule: if is not a function of
 Product Rule:
 Quotient Rule:
 Chain Rule:

We already used some of these (sum, scaling


and chain) when calculating the derivative for
the linear regression model

CS771: Intro to ML
6
Derivatives
 How the derivative itself changes tells us about the function’s optima
𝑓’(𝑥)= 0 and
𝑓’(𝑥)= 0 at 𝑥 is a maxima
𝑓’(𝑥)= 0 just before 𝑥
𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)>0 𝑓’(𝑥)= 0 just after 𝑥
just before 𝑥 𝑓’(𝑥)<0 𝑓’(𝑥)= 0 and
𝑥 may be a saddle
just after 𝑥 is a minima
𝑥 is a maxima
𝑓’(𝑥)= 0 and
may be a saddle. May need
𝑓’(𝑥)= 0 at 𝑥 higher derivatives
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑥 is a minima

 The second derivative can provide this information


CS771: Intro to ML
7
Saddle Points
 Points where derivative is zero but are neither minima nor maxima

A saddle
point

Saddle is a point of
inflection where the
derivative is also zero

 Saddle points are very common for loss functions of deep learning models
 Need to be handled carefully during optimization

 Second or higher derivative may help identify if a stationary point is a saddle


CS771: Intro to ML
8
Multivariate Functions
 Most functions that we see in ML are multivariate function

 Example: Loss fn in lin-reg was a multivar function of -dim vector

 Here is an illustration of a function of 2 variables (4 maxima and 5 minima)

Two-dim contour plot


of the function (i.e.,
what it looks like
from the above)

Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html CS771: Intro to ML


9
Derivatives of Multivariate Functions
 Can define derivative for a multivariate functions as well via the gradient

 Gradient of a function is a vector of partial derivatives

( )
𝜕𝑓 𝜕𝑓 𝜕𝑓 Each element in this gradient vector tells us
∇ 𝑓 ( 𝒙 )= , ,…, how much will change if we move a little
𝜕 𝑥1 𝜕 𝑥2 𝜕 𝑥𝐷 along the corresponding (akin to one-dim case)

 Optima and saddle points defined similar to one-dim case


 Required properties that we saw for one-dim case must be satisfied along all the
directions

 The second derivative in this case is known as the Hessian

CS771: Intro to ML
10
The Hessian
 For a multivar scalar valued function , Hessian is a matrix

[ ]
2
𝜕 𝑓
2
𝜕 𝑓 𝜕2 𝑓 Note: If the function itself is vector
valued, e.g., then we will have
𝜕 𝑥 21 𝜕 𝑥1 𝑥2 … 𝜕 𝑥1 𝑥 𝐷
2 2
𝜕 𝑓 … 𝜕2 𝑓 such Hessian matrices, one for each
𝜕 𝑓
2
𝛻 𝑓 ( 𝒙 )= 𝜕 𝑥 𝑥 𝜕 𝑥2 𝑥 𝐷 output dimension of
2 1 𝜕 𝑥 22
⋮ ⋮ ⋱ ⋮
Gives information 2 2
2
𝜕 𝑓
A square, symmetric matrix M is
𝜕 𝑓 𝜕 𝑓
about the curvature …
𝜕 𝑥𝐷
2 PSD if
𝜕 𝑥𝐷 𝑥1 𝜕 𝑥 𝐷 𝑥2 PSD if all
of the function at Will be NSD if
eigenvalues are
point
non-negative
 The Hessian matrix can be used to assess the optima/saddle points
 = 0 and is a positive semi-definite (PSD) matrix then is a minima
 = 0, and is a negative semi-definite (NSD) matrix then is a maxima

CS771: Intro to ML
11
Convex and Non-Convex Functions
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
Convex functions are bowl-shaped.
They have a unique optima
(minima)

Negative of a convex function is


called a concave function, which also
has a unique optima (maxima)

 Here are a couple of examples of non-convex functions Non-convex functions have


multiple minima. Usually
harder to optimize as
compared to convex functions

Loss functions of most


deep learning models are
non-convex

CS771: Intro to ML
12
Convex Sets
 A set S of points is a convex set, if for any two points , and 0 ≤ ≤ 1
is also called a “convex
combination” of two points 𝑧 =𝛼 𝑥+ ( 1 −𝛼 ) 𝑦 ∈ 𝑆 Can also define convex
combination of points as

 Above means that all points on the line-segment between and lie within

 The domain of a convex function needs to be a convex set


CS771: Intro to ML
13
Convex Functions
 Informally, is convex if all of its chords lie above the function everywhere

 Formally, (assuming differentiable function), some tests for convexity:


Exercise: Show
 First-order convexity (graph of must be above all the tangents) that ridge
regression
objective is convex

CS771: Intro to ML
14
Optimization Using First-Order Optimality
 Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take

 First order optimality: The gradient must be equal to zero at the optima

=0
 Sometimes, setting and solving for gives a closed form solution

 If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
15
Optimization via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
 Initialize as set carefully (fixed
or chosen
adaptively). Will
 For iteration (or until convergence) discuss some
 Calculate the gradient using the current iterates strategies later
 Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
 Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
16
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction

Positive gradient
here. Let’s move in
the negative direction

𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )

𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Stuck at a local
minima
Good initialization
is very important CS771: Intro to ML
17
GD: An Example
 Let’s apply GD for least squares linear regression

=
Training
 The gradient: examples on
Prediction error of current model which the current
 Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update

 Exercise: Assume , and show that GD update improves prediction on the


training input (, ), i.e, is closer to than to
 This is sort of a proof that GD updates are “corrective” in nature (and it actually is true
not just for linear regression but can also be shown for various other ML models)CS771: Intro to ML
18
Coming up next
 Gradients when the function is non-differentiable
 Solving optimization problems
 Iterative optimization algorithms, such as gradient descent and its variants

CS771: Intro to ML

You might also like