0% found this document useful (0 votes)
41 views

Ds 5

This document discusses optimization techniques for machine learning models, specifically constrained optimization and methods for handling multiple constraints and equality constraints when solving optimization problems. It introduces the concept of constructing a dual problem and shows how to derive the dual problem from the primal problem using the Lagrangian. It then provides steps for converting common types of constraints in optimization problems.

Uploaded by

Saransh Shivhare
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Ds 5

This document discusses optimization techniques for machine learning models, specifically constrained optimization and methods for handling multiple constraints and equality constraints when solving optimization problems. It introduces the concept of constructing a dual problem and shows how to derive the dual problem from the primal problem using the Lagrangian. It then provides steps for converting common types of constraints in optimization problems.

Uploaded by

Saransh Shivhare
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

My first Solver

Optimization for ML
Constrained Optimization 2
Method 3: Creating a Dual Problem
Suppose Let
we us
wish to solve
see how to handle
multiple constraints and
s.t. equality constraints
Trick: sneak this constraint into the objective
Construct a barrier (indicator) fn so that
Hmm … we still have a
if and otherwise, and simply solve
constraint here, but a very
Easy to see that both problems have the same simple
solution
one i.e.
One very elegant way to construct such a barrier is the following

Thus, we want to solve


Same as
A few Cleanup Steps 3
Step 1: Convert your problem to a minimization problem

Step 2: Convert all inequality constraints to constraints

Step 3: Convert all equality constraints to two inequality constraints

Step 4: For each constraint we now have, introduce a new variable


e.g. if we have inequality constraints , introduce new variables
These new variables are
The variables of the original optimization called dual variables or
problem, e.g. in this case, are called the sometimes even called
primal variables by comparison Lagrange multipliers
The Lagrangian 4
called the Lagrangian of the problem
s.t. If violates even one constraint, we have

If satisfies every single constraint, we have

This is just a nice way of


rewriting the above problem

{ { }}
𝐶
min max 𝑓 ( 𝐱 )+ ∑ 𝛂 𝑐 ⋅ 𝑔𝑐 ( 𝐱 )
𝐱 ∈ ℝ𝒅 𝛂 ∈ ℝ 𝐶 𝑐=1
𝛂 𝑐 ≥0
The Dual Problem 5
The original optimization problem is also called the primal problem
Recall: variables of the original problem e.g. called primal variables
Using the Lagrangian, we rewrote the primal problem as

The dual problem is obtained by simply switching order of min/max

In some cases, the dual problem is easier to solve than the primal
Duality 6
Let be the solutions to the primal problem i.e.

Let be the solutions to the dual problem i.e.

Strong Duality: if the original problem is convex and “nice”


Complementary Slackness: for all constraints
Note: not complimentary but complementary 
Hard SVM without a bias 7
such that for all
constraints so we need dual variables i.e.
Lagrangian:
Primal problem:
Dual problem:
The dual problem can be greatly simplified!
Simplifying the Dual Problem Once you get optimal values
of , use to get optimal value
of
8
Note that the inner problem in the dual problem is

Since this is an unconstrained problem with a convex and


differentiable objective, we can apply first order optimality to solve it
completely 
If we set the gradient to zero, we will get
Substituting this back in the dual problem we get

This is actually the problem several solvers (e.g. libsvm, sklearn) solve
Support Vectors Support Vectors!!

Recall: we have for every data point


After solving the dual problem, the data
points for which : Support Vectors
Usually we have support vectors
Recall: complementary slackness tells us that
i.e. only those data points
can become SVs for which i.e. at margin
The reason these are called support vectors has to do with a
mechanical interpretation of these objects – need to look at CSVM to
understand that
Dual for CSVM 10
Similar calculations (see course notes for a derivation) show that if we
have a bias term as well as slack variables, then the dual looks like

s.t. , and
Reason for the name “SVM”: imagine that each data point is
applying a force on the hyperplane in the direction
Then the total force on the hyperplane is equal to zero since
Also, the condition can be interpreted to mean that the total torque on the
hyperplane is zero as well
Thus, support vectors mechanically support the hyperplane (don’t let it shift or
rotate around), hence their name 
CSVM Dual Problem 11
If we have a bias , then the dual problem looks like

s.t. , and
The constraint links all together. Cannot update a single without
disturbing all the others 
A more involved algorithm Sequential Minimal Optimization (SMO) by
John Platt is needed to solve the version with a bias – updates two at a
time!
However, if we omit bias (hide it inside the model vector) the dual is

s.t.
We will see a method to solve this simpler version of the problem
Solvers for the SVMSub-gradient
problem since the primal objective
is convex but non-differentiable 12
We can solveYes,
thecoordinate
SVM (no bias)
ascent bydual
in the either
lookssolving the primal version
a lot like stochastic
gradient descent in the primal! Both work with a single data point
at a time
… or the dual version Projected since we have a constraint
(albeit a simple one) in the dual
s.t.
We may use gradient, coordinate etc methods to solve either
For primal, we may use sub-gradient descent, coordinate descent, etc
Does ascent,
For dual, we may use (projected) gradient this meancoordinate
I need to choose
ascent
one data point at each time step?
We will actually see how to do coordinate maximization for dual
Since the optimization variable in the dual is , we will need to take one
coordinate at each time i.e. choose a different at each time step
SDCM for the CSVM
Warning: Problem
in general, finding an unconstrained solution
and doing a projection step does not give a true solution
13
s.t. for all
Concentrating on just the terms that involve we get

s.t.
Renaming , we get
s.t.
Solution is very simple: find unrestricted minimum i.e.
If , solution is elif , solution is , else
Indeed! solution
In this specialis
case, our objective had
a nice property called unimodality which is
why this trick works – it won’t work in
general
Speeding up SDCM computations 14
All that is left is to find how to compute four our chosen
can be easily precomputed for all data points
However, needs time to compute 
… only if done naively. Recall that we always have for the CSVM
(even if we have bias and slack variables)
Thus,
If we somehow had access to , then computing would take time and
computing would take time
All we need to do is create (and update) the vector in addition to the
vector and we would be able to find in just time 
Which Method to Choose?
Can you work out the details on how to implement stochastic
primal coordinate descent in time per update? 15
Gradient Methods
Be careful not to get confused with similar sounding terms.
Primal Gradient Descent: time per update
Coordinate Ascent takes a small step along one of the coordinates
Dual Gradient Ascent:
to increase time per
the objective update
a bit. Coordinate Maximization instead
Stochastic Gradient Methods
tries to completely maximize the objective along a coordinate
Stochastic Primal Gradient Descent: time per update
AlsoAscent:
Stochastic Dual Gradient be careful thatper
time some books/papers may call
update
a method as “Coordinate Ascent” even when it
Coordinate Methods: (take time
is really perCoordinate
doing update Maximization.
if done naively)
The
Stochastic Primal Coordinate Descent:
terminology time peraupdate
is unfortunately bit non-standard
Stochastic Dual Coordinate Maximization: time per update
Case 1: : use SDCM or SPGD ( time per update)
Case 2: : use SDGA or SPCD ( time per update)
16
Practical Issues with GD Variants
 How to initialize?
 How to decide convergence?
 How to decide step lengths?
How to Initialize? 17
Initializing close to the global optimum is obviously preferable 
Easier said than done. In some applications however, we may have such
initialization e.g. someone may have a model they trained on different data
For convex functions, bad initialization may mean slow convergence,
but if step lengths are nice then GD should converge eventually
For non-convex functions (e.g. while training deepnets), bad
initialization may mean getting stuck at a very bad saddle point
Random restarts most common solution to overcome this problem
For some nice non-convex problems, we do know very good ways to
provably initialize close to the global optimum (e.g. collaborative
filtering in recommendation systems) – details beyond scope of CS771
How to decide Convergence? 18
In optimization, convergence can refer to a couple of things
The algorithm has gotten within a “small” distance of a global/local optima
The algorithm is not making “much” progress e.g.
GD stops making progress when it reaches a stationary point i.e. can
stop making progress even without having reached a global optimum
(e.g. if it has reached a saddle point)
Usually a few heuristics used to decide when to stop executing GD
If gradient vectors have become too “small”, or “not much” progress is being
made of if objective function value is already acceptably “small” or if
assignment submission deadline is 5 minutes away
Acceptable levels e.g. “small”, “not much” usually decided either by
consulting domain experts or else by using performance on validation sets
How to detect convergence 19
Method 1: Tolerance technique
For a pre-decided tolerance value , if , stop
Method 2: Zero-th order technique
If fn value has not changed much, stop (or else tune learning rate)!
or
Method 3: First order technique
If gradient has become too small i.e. , stop!
Method 4: Cross validation technique
Test the current model on validation data – if performance acceptable, stop!
Other techniques e.g. primal-dual techniques are usually infeasible for large-
scale ML problems and hence not used to decide convergence
How to choose Step Length? 20
For “nicely behaved” convex functions, have formulae for step length
Set or else where is a hyperparameter 
Basic idea is to choose (diminishing) and (infinite travel)
Simple, for “nice” convex functions -convergence in just steps
Details (e.g. what is “nice”) beyond scope of CS771 (see CS77X, X = 3,4,7)
A powerful but expensive technique is the Newton method

“Autotunes” the step length so that we may directly use


Offers extremely rapid convergence for “nice” convex problems: roughly, it
offers -convergence in just steps
However, computation of Hessian often expensive
Workaround: approximate using a diagonal or a low-rank matrix
How to choose Step Length? 21
For not so well behaved convex functions and non-convex functions,
there exist several heuristics – no guarantee they will always work 
Line-search Techniques: find the best step length every time

E.g. Armijo Rule: start by using with a decently large value of , if objective
function value does not reduce sufficiently, then reduce and try again
Line search can be expensive as it involves multiple GD steps, fn
evaluations
Cheaper “adaptive” techniques exist – these employ several tricks
Use a different step length for each dimension of (e.g. Adagrad) where
replaced with a diagonal matrix i.e.
Use “momentum” methods (e.g. NAG, Adam) which essentially infuses
previous gradients into the current gradient

You might also like