0% found this document useful (0 votes)

41 views

Ds 5

This document discusses optimization techniques for machine learning models, specifically constrained optimization and methods for handling multiple constraints and equality constraints when solving optimization problems. It introduces the concept of constructing a dual problem and shows how to derive the dual problem from the primal problem using the Lagrangian. It then provides steps for converting common types of constraints in optimization problems.

Uploaded by

Saransh Shivhare

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Ds 5

Uploaded by

Saransh Shivhare

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

My first Solver

Optimization for ML
Constrained Optimization 2
Method 3: Creating a Dual Problem
Suppose Let
we us
wish to solve
see how to handle
multiple constraints and
s.t. equality constraints
Trick: sneak this constraint into the objective
Construct a barrier (indicator) fn so that
Hmm … we still have a
if and otherwise, and simply solve
constraint here, but a very
Easy to see that both problems have the same simple
solution
one i.e.
One very elegant way to construct such a barrier is the following

Thus, we want to solve

Same as
A few Cleanup Steps 3
Step 1: Convert your problem to a minimization problem

Step 2: Convert all inequality constraints to constraints

Step 3: Convert all equality constraints to two inequality constraints

Step 4: For each constraint we now have, introduce a new variable

e.g. if we have inequality constraints , introduce new variables
These new variables are
The variables of the original optimization called dual variables or
problem, e.g. in this case, are called the sometimes even called
primal variables by comparison Lagrange multipliers
The Lagrangian 4
called the Lagrangian of the problem
s.t. If violates even one constraint, we have

If satisfies every single constraint, we have

This is just a nice way of

rewriting the above problem

{ { }}
𝐶
min max 𝑓 ( 𝐱 )+ ∑ 𝛂 𝑐 ⋅ 𝑔𝑐 ( 𝐱 )
𝐱 ∈ ℝ𝒅 𝛂 ∈ ℝ 𝐶 𝑐=1
𝛂 𝑐 ≥0
The Dual Problem 5
The original optimization problem is also called the primal problem
Recall: variables of the original problem e.g. called primal variables
Using the Lagrangian, we rewrote the primal problem as

The dual problem is obtained by simply switching order of min/max

In some cases, the dual problem is easier to solve than the primal
Duality 6
Let be the solutions to the primal problem i.e.

Let be the solutions to the dual problem i.e.

Strong Duality: if the original problem is convex and “nice”

Complementary Slackness: for all constraints
Note: not complimentary but complementary 
Hard SVM without a bias 7
such that for all
constraints so we need dual variables i.e.
Lagrangian:
Primal problem:
Dual problem:
The dual problem can be greatly simplified!
Simplifying the Dual Problem Once you get optimal values
of , use to get optimal value
of
8
Note that the inner problem in the dual problem is

Since this is an unconstrained problem with a convex and

differentiable objective, we can apply first order optimality to solve it
completely 
If we set the gradient to zero, we will get
Substituting this back in the dual problem we get

This is actually the problem several solvers (e.g. libsvm, sklearn) solve
Support Vectors Support Vectors!!

Recall: we have for every data point

After solving the dual problem, the data
points for which : Support Vectors
Usually we have support vectors
Recall: complementary slackness tells us that
i.e. only those data points
can become SVs for which i.e. at margin
The reason these are called support vectors has to do with a
mechanical interpretation of these objects – need to look at CSVM to
understand that
Dual for CSVM 10
Similar calculations (see course notes for a derivation) show that if we
have a bias term as well as slack variables, then the dual looks like

s.t. , and
Reason for the name “SVM”: imagine that each data point is
applying a force on the hyperplane in the direction
Then the total force on the hyperplane is equal to zero since
Also, the condition can be interpreted to mean that the total torque on the
hyperplane is zero as well
Thus, support vectors mechanically support the hyperplane (don’t let it shift or
rotate around), hence their name 
CSVM Dual Problem 11
If we have a bias , then the dual problem looks like

s.t. , and
The constraint links all together. Cannot update a single without
disturbing all the others 
A more involved algorithm Sequential Minimal Optimization (SMO) by
John Platt is needed to solve the version with a bias – updates two at a
time!
However, if we omit bias (hide it inside the model vector) the dual is

s.t.
We will see a method to solve this simpler version of the problem
Solvers for the SVMSub-gradient
problem since the primal objective
is convex but non-differentiable 12
We can solveYes,
thecoordinate
SVM (no bias)
ascent bydual
in the either
lookssolving the primal version
a lot like stochastic
gradient descent in the primal! Both work with a single data point
at a time
… or the dual version Projected since we have a constraint
(albeit a simple one) in the dual
s.t.
We may use gradient, coordinate etc methods to solve either
For primal, we may use sub-gradient descent, coordinate descent, etc
Does ascent,
For dual, we may use (projected) gradient this meancoordinate
I need to choose
ascent
one data point at each time step?
We will actually see how to do coordinate maximization for dual
Since the optimization variable in the dual is , we will need to take one
coordinate at each time i.e. choose a different at each time step
SDCM for the CSVM
Warning: Problem
in general, finding an unconstrained solution
and doing a projection step does not give a true solution
13
s.t. for all
Concentrating on just the terms that involve we get

s.t.
Renaming , we get
s.t.
Solution is very simple: find unrestricted minimum i.e.
If , solution is elif , solution is , else
Indeed! solution
In this specialis
case, our objective had
a nice property called unimodality which is
why this trick works – it won’t work in
general
Speeding up SDCM computations 14
All that is left is to find how to compute four our chosen
can be easily precomputed for all data points
However, needs time to compute 
… only if done naively. Recall that we always have for the CSVM
(even if we have bias and slack variables)
Thus,
If we somehow had access to , then computing would take time and
computing would take time
All we need to do is create (and update) the vector in addition to the
vector and we would be able to find in just time 
Which Method to Choose?
Can you work out the details on how to implement stochastic
primal coordinate descent in time per update? 15
Gradient Methods
Be careful not to get confused with similar sounding terms.
Primal Gradient Descent: time per update
Coordinate Ascent takes a small step along one of the coordinates
Dual Gradient Ascent:
to increase time per
the objective update
a bit. Coordinate Maximization instead
Stochastic Gradient Methods
tries to completely maximize the objective along a coordinate
Stochastic Primal Gradient Descent: time per update
AlsoAscent:
Stochastic Dual Gradient be careful thatper
time some books/papers may call
update
a method as “Coordinate Ascent” even when it
Coordinate Methods: (take time
is really perCoordinate
doing update Maximization.
if done naively)
The
Stochastic Primal Coordinate Descent:
terminology time peraupdate
is unfortunately bit non-standard
Stochastic Dual Coordinate Maximization: time per update
Case 1: : use SDCM or SPGD ( time per update)
Case 2: : use SDGA or SPCD ( time per update)
16
Practical Issues with GD Variants
 How to initialize?
 How to decide convergence?
 How to decide step lengths?
How to Initialize? 17
Initializing close to the global optimum is obviously preferable 
Easier said than done. In some applications however, we may have such
initialization e.g. someone may have a model they trained on different data
For convex functions, bad initialization may mean slow convergence,
but if step lengths are nice then GD should converge eventually
For non-convex functions (e.g. while training deepnets), bad
initialization may mean getting stuck at a very bad saddle point
Random restarts most common solution to overcome this problem
For some nice non-convex problems, we do know very good ways to
provably initialize close to the global optimum (e.g. collaborative
filtering in recommendation systems) – details beyond scope of CS771
How to decide Convergence? 18
In optimization, convergence can refer to a couple of things
The algorithm has gotten within a “small” distance of a global/local optima
The algorithm is not making “much” progress e.g.
GD stops making progress when it reaches a stationary point i.e. can
stop making progress even without having reached a global optimum
(e.g. if it has reached a saddle point)
Usually a few heuristics used to decide when to stop executing GD
If gradient vectors have become too “small”, or “not much” progress is being
made of if objective function value is already acceptably “small” or if
assignment submission deadline is 5 minutes away
Acceptable levels e.g. “small”, “not much” usually decided either by
consulting domain experts or else by using performance on validation sets
How to detect convergence 19
Method 1: Tolerance technique
For a pre-decided tolerance value , if , stop
Method 2: Zero-th order technique
If fn value has not changed much, stop (or else tune learning rate)!
or
Method 3: First order technique
If gradient has become too small i.e. , stop!
Method 4: Cross validation technique
Test the current model on validation data – if performance acceptable, stop!
Other techniques e.g. primal-dual techniques are usually infeasible for large-
scale ML problems and hence not used to decide convergence
How to choose Step Length? 20
For “nicely behaved” convex functions, have formulae for step length
Set or else where is a hyperparameter 
Basic idea is to choose (diminishing) and (infinite travel)
Simple, for “nice” convex functions -convergence in just steps
Details (e.g. what is “nice”) beyond scope of CS771 (see CS77X, X = 3,4,7)
A powerful but expensive technique is the Newton method

“Autotunes” the step length so that we may directly use

Offers extremely rapid convergence for “nice” convex problems: roughly, it
offers -convergence in just steps
However, computation of Hessian often expensive
Workaround: approximate using a diagonal or a low-rank matrix
How to choose Step Length? 21
For not so well behaved convex functions and non-convex functions,
there exist several heuristics – no guarantee they will always work 
Line-search Techniques: find the best step length every time

E.g. Armijo Rule: start by using with a decently large value of , if objective
function value does not reduce sufficiently, then reduce and try again
Line search can be expensive as it involves multiple GD steps, fn
evaluations
Cheaper “adaptive” techniques exist – these employ several tricks
Use a different step length for each dimension of (e.g. Adagrad) where
replaced with a diagonal matrix i.e.
Use “momentum” methods (e.g. NAG, Adam) which essentially infuses
previous gradients into the current gradient

lec4
No ratings yet
lec4
19 pages
Ds 3
No ratings yet
Ds 3
25 pages
Lecture 9 - SVM
No ratings yet
Lecture 9 - SVM
42 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Lecture 17 - Hyperplane Classifiers - SVM - Plain
No ratings yet
Lecture 17 - Hyperplane Classifiers - SVM - Plain
16 pages
lec3
No ratings yet
lec3
22 pages
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
No ratings yet
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
4 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
No ratings yet
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
39 pages
cddual
No ratings yet
cddual
14 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Machine Learning - SVM
No ratings yet
Machine Learning - SVM
11 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
Ds 2
No ratings yet
Ds 2
27 pages
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
No ratings yet
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
14 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM
No ratings yet
SVM
28 pages
Lecture 7_SVM
No ratings yet
Lecture 7_SVM
125 pages
Support Vector Machines & Kernels: David Sontag New York University
No ratings yet
Support Vector Machines & Kernels: David Sontag New York University
19 pages
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
No ratings yet
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
15 pages
SVM-1
No ratings yet
SVM-1
36 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
An Idiot's Guide To Support Vector Machines
No ratings yet
An Idiot's Guide To Support Vector Machines
28 pages
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
No ratings yet
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
14 pages
Lecture 8: Strong Duality: 8.1.1 Primal and Dual Problems
No ratings yet
Lecture 8: Strong Duality: 8.1.1 Primal and Dual Problems
9 pages
An Idiot Guide To SVM
No ratings yet
An Idiot Guide To SVM
25 pages
SVM 2
No ratings yet
SVM 2
8 pages
Math Behind SVM Part 2 (Support Vector Machine) - by MLMath - Io - Medium
No ratings yet
Math Behind SVM Part 2 (Support Vector Machine) - by MLMath - Io - Medium
6 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Support Vecto Machine (3)
No ratings yet
Support Vecto Machine (3)
62 pages
Introduction to Support Vector Regression (SVR)
No ratings yet
Introduction to Support Vector Regression (SVR)
28 pages
shalev-shwartz13a
No ratings yet
shalev-shwartz13a
33 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Lecture7[1]
No ratings yet
Lecture7[1]
46 pages
07 SVMs
No ratings yet
07 SVMs
68 pages
M8 SupportVectorMachines
No ratings yet
M8 SupportVectorMachines
64 pages
Class06 SVM
No ratings yet
Class06 SVM
47 pages
An Overview on Support Vector Machines
No ratings yet
An Overview on Support Vector Machines
14 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
77 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
24 pages
Distributed Optimization Via Alternating Direction Method of Multipliers
No ratings yet
Distributed Optimization Via Alternating Direction Method of Multipliers
23 pages
Chapter 07 SVM
No ratings yet
Chapter 07 SVM
20 pages
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
100% (1)
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
44 pages
(Optimization) SVMs
No ratings yet
(Optimization) SVMs
19 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
1632118884_ML-TCS-Lecture-15 (1)
No ratings yet
1632118884_ML-TCS-Lecture-15 (1)
46 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
No ratings yet
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
31 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Survit Sra Efficient Primal SVM
No ratings yet
Survit Sra Efficient Primal SVM
8 pages
LIBSVM A Library For Support Vector Machines
No ratings yet
LIBSVM A Library For Support Vector Machines
25 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
The Joy of JavaScript With a Side of Vue.js
From Everand
The Joy of JavaScript With a Side of Vue.js
Tom Henricksen
No ratings yet
JavaScript & Vue.js: A Match Made in Heaven
From Everand
JavaScript & Vue.js: A Match Made in Heaven
Tom Henricksen
No ratings yet
Simulated Annealing: Fundamentals and Applications
From Everand
Simulated Annealing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Quantitative Analysis For Management Decisions (MBA 662)
No ratings yet
Quantitative Analysis For Management Decisions (MBA 662)
50 pages
Laa 2024
No ratings yet
Laa 2024
45 pages
To Study and Implement Circle Generating Algorithm.: Trinad G Somani D7A/54
No ratings yet
To Study and Implement Circle Generating Algorithm.: Trinad G Somani D7A/54
5 pages
Study Material: Free Master Class Series
No ratings yet
Study Material: Free Master Class Series
23 pages
Making Up Numbers
100% (2)
Making Up Numbers
282 pages
Trigonometry
No ratings yet
Trigonometry
12 pages
Schur's Lemma: Appendix A
No ratings yet
Schur's Lemma: Appendix A
2 pages
APPLIED HYDROGEOLOGY - WR423 - Lecture - Series - 3
100% (1)
APPLIED HYDROGEOLOGY - WR423 - Lecture - Series - 3
33 pages
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
No ratings yet
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
53 pages
Keeping Neural Networks Simple by Minimizing The Description Length of The Weights
No ratings yet
Keeping Neural Networks Simple by Minimizing The Description Length of The Weights
9 pages
Calculus and Its Applications 11th Edition by Bittinger Ellenbogen Surgent ISBN Test Bank
100% (38)
Calculus and Its Applications 11th Edition by Bittinger Ellenbogen Surgent ISBN Test Bank
215 pages
Chapter 6 PDF
No ratings yet
Chapter 6 PDF
47 pages
AP Calc 3.1 Guided Notes
No ratings yet
AP Calc 3.1 Guided Notes
4 pages
Maths Assn - 1 Xii-1
No ratings yet
Maths Assn - 1 Xii-1
1 page
A) by Using The Fourier Transform Properties, Show That 1 2 1 2
No ratings yet
A) by Using The Fourier Transform Properties, Show That 1 2 1 2
3 pages
S. 3 Term I Holiday Work 2015
No ratings yet
S. 3 Term I Holiday Work 2015
3 pages
Study Material 10TH Maths, 2023-24 - Polynomials
No ratings yet
Study Material 10TH Maths, 2023-24 - Polynomials
4 pages
Noncommutative Functional Calculus Theory and Applications of Slice Hyperholomorphic Functions 1st Edition Fabrizio Colombo - Quickly download the ebook in PDF format for unlimited reading
100% (2)
Noncommutative Functional Calculus Theory and Applications of Slice Hyperholomorphic Functions 1st Edition Fabrizio Colombo - Quickly download the ebook in PDF format for unlimited reading
44 pages
Newton Raphson Coursework Example
100% (2)
Newton Raphson Coursework Example
9 pages
WEEK 8 Polynomial Equations
No ratings yet
WEEK 8 Polynomial Equations
4 pages
19.1 Vectors and Scalars
No ratings yet
19.1 Vectors and Scalars
25 pages
04-04-2025_jee Main Evening Shift Maths Memory Based Paper
No ratings yet
04-04-2025_jee Main Evening Shift Maths Memory Based Paper
7 pages
Classification of Bicovariant Differential Calculi On Quantum Groups (A Representation-Theoretic Approach)
No ratings yet
Classification of Bicovariant Differential Calculi On Quantum Groups (A Representation-Theoretic Approach)
19 pages
Rayleigh Ritz Method
No ratings yet
Rayleigh Ritz Method
7 pages
Faculty of Sciences Mathematics: Math 200, Calculus I
No ratings yet
Faculty of Sciences Mathematics: Math 200, Calculus I
5 pages
Applied Mathematics Sem-2 Book Join "All India Polytechnic AICTE" Telegram Group
No ratings yet
Applied Mathematics Sem-2 Book Join "All India Polytechnic AICTE" Telegram Group
5 pages
Statistics Frequency Distribution Table
No ratings yet
Statistics Frequency Distribution Table
7 pages
Applications of Linear Transformation: I. Computer Graphics
No ratings yet
Applications of Linear Transformation: I. Computer Graphics
7 pages
Partial Least Squares Regression A Tutorial
100% (1)
Partial Least Squares Regression A Tutorial
17 pages
Stuff Must Know Cold New
No ratings yet
Stuff Must Know Cold New
3 pages

Ds 5

Uploaded by

Ds 5

Uploaded by

My first Solver

Thus, we want to solve

Step 2: Convert all inequality constraints to constraints

Step 3: Convert all equality constraints to two inequality constraints

Step 4: For each constraint we now have, introduce a new variable

If satisfies every single constraint, we have

This is just a nice way of

The dual problem is obtained by simply switching order of min/max

Let be the solutions to the dual problem i.e.

Strong Duality: if the original problem is convex and “nice”

Since this is an unconstrained problem with a convex and

Recall: we have for every data point

“Autotunes” the step length so that we may directly use

You might also like