Ds 5
Ds 5
Optimization for ML
Constrained Optimization 2
Method 3: Creating a Dual Problem
Suppose Let
we us
wish to solve
see how to handle
multiple constraints and
s.t. equality constraints
Trick: sneak this constraint into the objective
Construct a barrier (indicator) fn so that
Hmm … we still have a
if and otherwise, and simply solve
constraint here, but a very
Easy to see that both problems have the same simple
solution
one i.e.
One very elegant way to construct such a barrier is the following
{ { }}
𝐶
min max 𝑓 ( 𝐱 )+ ∑ 𝛂 𝑐 ⋅ 𝑔𝑐 ( 𝐱 )
𝐱 ∈ ℝ𝒅 𝛂 ∈ ℝ 𝐶 𝑐=1
𝛂 𝑐 ≥0
The Dual Problem 5
The original optimization problem is also called the primal problem
Recall: variables of the original problem e.g. called primal variables
Using the Lagrangian, we rewrote the primal problem as
In some cases, the dual problem is easier to solve than the primal
Duality 6
Let be the solutions to the primal problem i.e.
This is actually the problem several solvers (e.g. libsvm, sklearn) solve
Support Vectors Support Vectors!!
s.t. , and
Reason for the name “SVM”: imagine that each data point is
applying a force on the hyperplane in the direction
Then the total force on the hyperplane is equal to zero since
Also, the condition can be interpreted to mean that the total torque on the
hyperplane is zero as well
Thus, support vectors mechanically support the hyperplane (don’t let it shift or
rotate around), hence their name
CSVM Dual Problem 11
If we have a bias , then the dual problem looks like
s.t. , and
The constraint links all together. Cannot update a single without
disturbing all the others
A more involved algorithm Sequential Minimal Optimization (SMO) by
John Platt is needed to solve the version with a bias – updates two at a
time!
However, if we omit bias (hide it inside the model vector) the dual is
s.t.
We will see a method to solve this simpler version of the problem
Solvers for the SVMSub-gradient
problem since the primal objective
is convex but non-differentiable 12
We can solveYes,
thecoordinate
SVM (no bias)
ascent bydual
in the either
lookssolving the primal version
a lot like stochastic
gradient descent in the primal! Both work with a single data point
at a time
… or the dual version Projected since we have a constraint
(albeit a simple one) in the dual
s.t.
We may use gradient, coordinate etc methods to solve either
For primal, we may use sub-gradient descent, coordinate descent, etc
Does ascent,
For dual, we may use (projected) gradient this meancoordinate
I need to choose
ascent
one data point at each time step?
We will actually see how to do coordinate maximization for dual
Since the optimization variable in the dual is , we will need to take one
coordinate at each time i.e. choose a different at each time step
SDCM for the CSVM
Warning: Problem
in general, finding an unconstrained solution
and doing a projection step does not give a true solution
13
s.t. for all
Concentrating on just the terms that involve we get
s.t.
Renaming , we get
s.t.
Solution is very simple: find unrestricted minimum i.e.
If , solution is elif , solution is , else
Indeed! solution
In this specialis
case, our objective had
a nice property called unimodality which is
why this trick works – it won’t work in
general
Speeding up SDCM computations 14
All that is left is to find how to compute four our chosen
can be easily precomputed for all data points
However, needs time to compute
… only if done naively. Recall that we always have for the CSVM
(even if we have bias and slack variables)
Thus,
If we somehow had access to , then computing would take time and
computing would take time
All we need to do is create (and update) the vector in addition to the
vector and we would be able to find in just time
Which Method to Choose?
Can you work out the details on how to implement stochastic
primal coordinate descent in time per update? 15
Gradient Methods
Be careful not to get confused with similar sounding terms.
Primal Gradient Descent: time per update
Coordinate Ascent takes a small step along one of the coordinates
Dual Gradient Ascent:
to increase time per
the objective update
a bit. Coordinate Maximization instead
Stochastic Gradient Methods
tries to completely maximize the objective along a coordinate
Stochastic Primal Gradient Descent: time per update
AlsoAscent:
Stochastic Dual Gradient be careful thatper
time some books/papers may call
update
a method as “Coordinate Ascent” even when it
Coordinate Methods: (take time
is really perCoordinate
doing update Maximization.
if done naively)
The
Stochastic Primal Coordinate Descent:
terminology time peraupdate
is unfortunately bit non-standard
Stochastic Dual Coordinate Maximization: time per update
Case 1: : use SDCM or SPGD ( time per update)
Case 2: : use SDGA or SPCD ( time per update)
16
Practical Issues with GD Variants
How to initialize?
How to decide convergence?
How to decide step lengths?
How to Initialize? 17
Initializing close to the global optimum is obviously preferable
Easier said than done. In some applications however, we may have such
initialization e.g. someone may have a model they trained on different data
For convex functions, bad initialization may mean slow convergence,
but if step lengths are nice then GD should converge eventually
For non-convex functions (e.g. while training deepnets), bad
initialization may mean getting stuck at a very bad saddle point
Random restarts most common solution to overcome this problem
For some nice non-convex problems, we do know very good ways to
provably initialize close to the global optimum (e.g. collaborative
filtering in recommendation systems) – details beyond scope of CS771
How to decide Convergence? 18
In optimization, convergence can refer to a couple of things
The algorithm has gotten within a “small” distance of a global/local optima
The algorithm is not making “much” progress e.g.
GD stops making progress when it reaches a stationary point i.e. can
stop making progress even without having reached a global optimum
(e.g. if it has reached a saddle point)
Usually a few heuristics used to decide when to stop executing GD
If gradient vectors have become too “small”, or “not much” progress is being
made of if objective function value is already acceptably “small” or if
assignment submission deadline is 5 minutes away
Acceptable levels e.g. “small”, “not much” usually decided either by
consulting domain experts or else by using performance on validation sets
How to detect convergence 19
Method 1: Tolerance technique
For a pre-decided tolerance value , if , stop
Method 2: Zero-th order technique
If fn value has not changed much, stop (or else tune learning rate)!
or
Method 3: First order technique
If gradient has become too small i.e. , stop!
Method 4: Cross validation technique
Test the current model on validation data – if performance acceptable, stop!
Other techniques e.g. primal-dual techniques are usually infeasible for large-
scale ML problems and hence not used to decide convergence
How to choose Step Length? 20
For “nicely behaved” convex functions, have formulae for step length
Set or else where is a hyperparameter
Basic idea is to choose (diminishing) and (infinite travel)
Simple, for “nice” convex functions -convergence in just steps
Details (e.g. what is “nice”) beyond scope of CS771 (see CS77X, X = 3,4,7)
A powerful but expensive technique is the Newton method
E.g. Armijo Rule: start by using with a decently large value of , if objective
function value does not reduce sufficiently, then reduce and try again
Line search can be expensive as it involves multiple GD steps, fn
evaluations
Cheaper “adaptive” techniques exist – these employ several tricks
Use a different step length for each dimension of (e.g. Adagrad) where
replaced with a diagonal matrix i.e.
Use “momentum” methods (e.g. NAG, Adam) which essentially infuses
previous gradients into the current gradient