0% found this document useful (0 votes)

48 views

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

This document discusses optimization techniques for machine learning models. It explains that while parameter estimation can be done in closed form for some models like linear regression, this is not always the case. In those situations, the parameter estimation problem is treated as an optimization problem to minimize a loss function. It introduces concepts like derivatives, gradients, and Hessians that are used to find the minimum of multivariate loss functions through optimization algorithms.

Uploaded by

Raja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

Uploaded by

Raja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Optimization for ML

CS771: Introduction to Machine Learning

Nisheeth
2

Today’s class
• In the last class, we saw that parameter estimation for the linear
regression model is possible in closed form
• This is not always the case for all ML models. What do we do in those
cases?
• We treat the parameter estimation problem as a problem of function
optimization
• There is lots of math, but it’s very intuitive Loss function

• Don’t be intimidated
Minima

ML params

Nice reference for today’s material.

For those of you interested in a deeper dive in the math, see Ch 3 in this book CS771: Intro to ML
The objective function of the 3
Functions and their optima
Assume unconstrained
ML problem we are solving for now, i.e., just a real-
(e.g., squared loss for valued number/vector
regression)
 Many ML problems require us to optimize a function of some variable(s)
 For simplicity, assume is a scalar-valued function of a scalar (
A local A local Global
𝑓 (𝑥) maxima maxima maxima
Usually interested in
global optima but often
want to find local optima,
too
A local For deep learning models, often the
A local minima A local local optima are what we can find (and
minima minima 𝑥 they usually suffice) – more later

Will see what

Global these are later
minima
 Any function has one/more optima (maxima, minima), and maybe saddle
points

 Finding the optima or saddles requires derivatives/gradients of the function

CS771: Intro to ML
4
Derivatives Will sometimes use to denote
the derivative

 Magnitude of derivative at a point is the rate of change of the func at that point

=
𝑓 (𝑥)
Sign is also important: Positive derivative
means is increasing at if we increase the ∆ 𝑓 (𝑥)
value of by a very small amount; negative ∆ 𝑓 (𝑥)
derivative means it is decreasing
∆𝑥 ∆𝑥 𝑥
Understanding how changes its value as we
change is helpful to understand optimization
(minimization/maximization) algorithms
 Derivative becomes zero at stationary points (optima or saddle points)
 The function becomes “flat” ( if we change by a very little at such points)
 These are the points where the function has its maxima/minima (unless they are saddles)
CS771: Intro to ML
5
Rules of Derivatives
Some basic rules of taking derivatives
 Sum Rule:
 Scaling Rule: if is not a function of
 Product Rule:
 Quotient Rule:
 Chain Rule:

We already used some of these (sum, scaling

and chain) when calculating the derivative for
the linear regression model

CS771: Intro to ML
6
Derivatives
 How the derivative itself changes tells us about the function’s optima
𝑓’(𝑥)= 0 and
𝑓’(𝑥)= 0 at 𝑥 is a maxima
𝑓’(𝑥)= 0 just before 𝑥
𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)>0 𝑓’(𝑥)= 0 just after 𝑥
just before 𝑥 𝑓’(𝑥)<0 𝑓’(𝑥)= 0 and
𝑥 may be a saddle
just after 𝑥 is a minima
𝑥 is a maxima
𝑓’(𝑥)= 0 and
may be a saddle. May need
𝑓’(𝑥)= 0 at 𝑥 higher derivatives
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑥 is a minima

 The second derivative can provide this information

CS771: Intro to ML
7
Saddle Points
 Points where derivative is zero but are neither minima nor maxima

A saddle
point

Saddle is a point of
inflection where the
derivative is also zero

 Saddle points are very common for loss functions of deep learning models
 Need to be handled carefully during optimization

 Second or higher derivative may help identify if a stationary point is a saddle

CS771: Intro to ML
8
Multivariate Functions
 Most functions that we see in ML are multivariate function

 Example: Loss fn in lin-reg was a multivar function of -dim vector

 Here is an illustration of a function of 2 variables (4 maxima and 5 minima)

Two-dim contour plot

of the function (i.e.,
what it looks like
from the above)

Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html CS771: Intro to ML

9
Derivatives of Multivariate Functions
 Can define derivative for a multivariate functions as well via the gradient

 Gradient of a function is a vector of partial derivatives

( )
𝜕𝑓 𝜕𝑓 𝜕𝑓 Each element in this gradient vector tells us
∇ 𝑓 ( 𝒙 )= , ,…, how much will change if we move a little
𝜕 𝑥1 𝜕 𝑥2 𝜕 𝑥𝐷 along the corresponding (akin to one-dim case)

 Optima and saddle points defined similar to one-dim case

 Required properties that we saw for one-dim case must be satisfied along all the
directions

 The second derivative in this case is known as the Hessian

CS771: Intro to ML
10
The Hessian
 For a multivar scalar valued function , Hessian is a matrix

[ ]
2
𝜕 𝑓
2
𝜕 𝑓 𝜕2 𝑓 Note: If the function itself is vector
valued, e.g., then we will have
𝜕 𝑥 21 𝜕 𝑥1 𝑥2 … 𝜕 𝑥1 𝑥 𝐷
2 2
𝜕 𝑓 … 𝜕2 𝑓 such Hessian matrices, one for each
𝜕 𝑓
2
𝛻 𝑓 ( 𝒙 )= 𝜕 𝑥 𝑥 𝜕 𝑥2 𝑥 𝐷 output dimension of
2 1 𝜕 𝑥 22
⋮ ⋮ ⋱ ⋮
Gives information 2 2
2
𝜕 𝑓
A square, symmetric matrix M is
𝜕 𝑓 𝜕 𝑓
about the curvature …
𝜕 𝑥𝐷
2 PSD if
𝜕 𝑥𝐷 𝑥1 𝜕 𝑥 𝐷 𝑥2 PSD if all
of the function at Will be NSD if
eigenvalues are
point
non-negative
 The Hessian matrix can be used to assess the optima/saddle points
 = 0 and is a positive semi-definite (PSD) matrix then is a minima
 = 0, and is a negative semi-definite (NSD) matrix then is a maxima

CS771: Intro to ML
11
Convex and Non-Convex Functions
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
Convex functions are bowl-shaped.
They have a unique optima
(minima)

Negative of a convex function is

called a concave function, which also
has a unique optima (maxima)

 Here are a couple of examples of non-convex functions Non-convex functions have

multiple minima. Usually
harder to optimize as
compared to convex functions

Loss functions of most

deep learning models are
non-convex

CS771: Intro to ML
12
Convex Sets
 A set S of points is a convex set, if for any two points , and 0 ≤ ≤ 1
is also called a “convex
combination” of two points 𝑧 =𝛼 𝑥+ ( 1 −𝛼 ) 𝑦 ∈ 𝑆 Can also define convex
combination of points as

 Above means that all points on the line-segment between and lie within

 The domain of a convex function needs to be a convex set

CS771: Intro to ML
13
Convex Functions
 Informally, is convex if all of its chords lie above the function everywhere

 Formally, (assuming differentiable function), some tests for convexity:

Exercise: Show
 First-order convexity (graph of must be above all the tangents) that ridge
regression
objective is convex

CS771: Intro to ML
14
Optimization Using First-Order Optimality
 Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take

 First order optimality: The gradient must be equal to zero at the optima

=0
 Sometimes, setting and solving for gives a closed form solution

 If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
15
Optimization via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
 Initialize as set carefully (fixed
or chosen
adaptively). Will
 For iteration (or until convergence) discuss some
 Calculate the gradient using the current iterates strategies later
 Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
 Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
16
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction

Positive gradient
here. Let’s move in
the negative direction

𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )
∗
𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Stuck at a local
minima
Good initialization
is very important CS771: Intro to ML
17
GD: An Example
 Let’s apply GD for least squares linear regression

=
Training
 The gradient: examples on
Prediction error of current model which the current
 Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update

 Exercise: Assume , and show that GD update improves prediction on the

training input (, ), i.e, is closer to than to
 This is sort of a proof that GD updates are “corrective” in nature (and it actually is true
not just for linear regression but can also be shown for various other ML models)CS771: Intro to ML
18
Coming up next
 Gradients when the function is non-differentiable
 Solving optimization problems
 Iterative optimization algorithms, such as gradient descent and its variants

CS771: Intro to ML

Pre-Calculus 11 Workbook
No ratings yet
Pre-Calculus 11 Workbook
44 pages
Etk Material Curves 20121018
No ratings yet
Etk Material Curves 20121018
16 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Lecture 91
No ratings yet
Lecture 91
17 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
Lecture 09 - Calculus and Optimization Techniques (3) - Plain
No ratings yet
Lecture 09 - Calculus and Optimization Techniques (3) - Plain
15 pages
771 A18 Lec9
No ratings yet
771 A18 Lec9
129 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
ML Notes
No ratings yet
ML Notes
14 pages
Math Revision For DS and ML
No ratings yet
Math Revision For DS and ML
74 pages
6 - Support Vector Machines
No ratings yet
6 - Support Vector Machines
14 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
CS115 Optimization
No ratings yet
CS115 Optimization
160 pages
Maths For ML
No ratings yet
Maths For ML
1 page
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Lect 5- Gradient Descent
No ratings yet
Lect 5- Gradient Descent
31 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
Continuous Optimization
No ratings yet
Continuous Optimization
51 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Konveksna Optimizacija
No ratings yet
Konveksna Optimizacija
179 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Lec 03
No ratings yet
Lec 03
42 pages
09 - nlp1 - Online
No ratings yet
09 - nlp1 - Online
23 pages
09_convex
No ratings yet
09_convex
48 pages
Concave + Convex
No ratings yet
Concave + Convex
37 pages
Lecture 4 Introduction to Calculus (Part 1)
No ratings yet
Lecture 4 Introduction to Calculus (Part 1)
45 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
No ratings yet
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
23 pages
Continuous Optimization
No ratings yet
Continuous Optimization
23 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Lecture_1_2_background
No ratings yet
Lecture_1_2_background
6 pages
Ds 3
No ratings yet
Ds 3
16 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
IE643 Lecture8 2020sep11 2020sep8
No ratings yet
IE643 Lecture8 2020sep11 2020sep8
100 pages
DL- Unit 2
No ratings yet
DL- Unit 2
60 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Lecture 02
No ratings yet
Lecture 02
30 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Chapter
No ratings yet
Chapter
46 pages
lec3
No ratings yet
lec3
22 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
2-LR_Optim
No ratings yet
2-LR_Optim
60 pages
OPTIMIZATION _Lecture3_RSB
No ratings yet
OPTIMIZATION _Lecture3_RSB
31 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Lecture 1 Gradients and More
No ratings yet
Lecture 1 Gradients and More
35 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
No ratings yet
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
16 pages
Lecture 02 - Warming-Up and Data and Features - Plain
No ratings yet
Lecture 02 - Warming-Up and Data and Features - Plain
23 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Lecture 03 - Supervised Learning by Computing Distances - Plain
No ratings yet
Lecture 03 - Supervised Learning by Computing Distances - Plain
17 pages
Bernd Klein Python and Machine Learning Letter
No ratings yet
Bernd Klein Python and Machine Learning Letter
453 pages
Bernd Klein Python Data Analysis Letter
No ratings yet
Bernd Klein Python Data Analysis Letter
514 pages
Dataset: (Most Famous)
No ratings yet
Dataset: (Most Famous)
8 pages
General Observation
No ratings yet
General Observation
93 pages
Cnns Convolution Neural Networks
No ratings yet
Cnns Convolution Neural Networks
50 pages
Model Training: (Anything Done While We Train The Model)
No ratings yet
Model Training: (Anything Done While We Train The Model)
194 pages
Command Line Python Scripting: Takeaways: Syntax
No ratings yet
Command Line Python Scripting: Takeaways: Syntax
2 pages
A B Testing
No ratings yet
A B Testing
28 pages
Working With Programs: Takeaways: Syntax
No ratings yet
Working With Programs: Takeaways: Syntax
2 pages
L 18 Java Package and Access Specifiers
No ratings yet
L 18 Java Package and Access Specifiers
4 pages
TTL 1 Course Outline
No ratings yet
TTL 1 Course Outline
4 pages
Visualizer 1
No ratings yet
Visualizer 1
67 pages
Full download Schaum s Outline of Applied Physics 3rd Edition Arthur Beiser pdf docx
100% (16)
Full download Schaum s Outline of Applied Physics 3rd Edition Arthur Beiser pdf docx
60 pages
Weight of G.I.Strip Flat KG MT.: X X X X X X X X X
33% (3)
Weight of G.I.Strip Flat KG MT.: X X X X X X X X X
2 pages
2.6 Limits at Infinity
No ratings yet
2.6 Limits at Infinity
5 pages
Mil STD 40051 1C
No ratings yet
Mil STD 40051 1C
560 pages
Maxwell - S Eq Study Material
No ratings yet
Maxwell - S Eq Study Material
21 pages
Pete Ogl d3d Version 1 75
100% (2)
Pete Ogl d3d Version 1 75
78 pages
1. AGRISE TEBU IKA
No ratings yet
1. AGRISE TEBU IKA
7 pages
Extra Exercise 7.3 Einstein's Photoelectric Effect
No ratings yet
Extra Exercise 7.3 Einstein's Photoelectric Effect
4 pages
23
No ratings yet
23
70 pages
Syllabus LTR+KOP+AKL
No ratings yet
Syllabus LTR+KOP+AKL
3 pages
Assignment
No ratings yet
Assignment
2 pages
Field Guide:: Industrial Ethernet Connectivity
No ratings yet
Field Guide:: Industrial Ethernet Connectivity
11 pages
Electric Conductivity and Resistivity of Aluminum Alloys For The Application of Electric Counters
No ratings yet
Electric Conductivity and Resistivity of Aluminum Alloys For The Application of Electric Counters
80 pages
Synergistic Effect of Lantana Camara
No ratings yet
Synergistic Effect of Lantana Camara
5 pages
Device 3679016160209864331
100% (1)
Device 3679016160209864331
11 pages
Paper 1
No ratings yet
Paper 1
7 pages
2005 August Prov Exam
No ratings yet
2005 August Prov Exam
30 pages
Free Space Laser Communication: Prepared By: M.Srikanth Reddy
No ratings yet
Free Space Laser Communication: Prepared By: M.Srikanth Reddy
22 pages
Group Project Sta589
No ratings yet
Group Project Sta589
16 pages
2011 - Gollwitzer, Meder & Schmitt - Revenge
No ratings yet
2011 - Gollwitzer, Meder & Schmitt - Revenge
12 pages
BJ Atomic Structure Exercises
100% (1)
BJ Atomic Structure Exercises
22 pages
Rat, Mice and Guinea Pig Feeding
No ratings yet
Rat, Mice and Guinea Pig Feeding
32 pages
CAMSHAFT (Machining) : A Project Report ON
No ratings yet
CAMSHAFT (Machining) : A Project Report ON
21 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Measurement With Persons Theory Methods and Implementation Areas 1st Edition Birgitta Berglund - Own the ebook now and start reading instantly
100% (2)
Measurement With Persons Theory Methods and Implementation Areas 1st Edition Birgitta Berglund - Own the ebook now and start reading instantly
58 pages

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

Uploaded by

Optimization For ML: CS771: Introduction To Machine Learning Nisheeth

Uploaded by

Optimization for ML

CS771: Introduction to Machine Learning

Nice reference for today’s material.

Will see what

 Finding the optima or saddles requires derivatives/gradients of the function

We already used some of these (sum, scaling

 The second derivative can provide this information

 Second or higher derivative may help identify if a stationary point is a saddle

 Example: Loss fn in lin-reg was a multivar function of -dim vector

 Here is an illustration of a function of 2 variables (4 maxima and 5 minima)

Two-dim contour plot

Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html CS771: Intro to ML

 Gradient of a function is a vector of partial derivatives

 Optima and saddle points defined similar to one-dim case

 The second derivative in this case is known as the Hessian

Negative of a convex function is

 Here are a couple of examples of non-convex functions Non-convex functions have

Loss functions of most

 The domain of a convex function needs to be a convex set

 Formally, (assuming differentiable function), some tests for convexity:

 Exercise: Assume , and show that GD update improves prediction on the

You might also like