0% found this document useful (0 votes)

2 views

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

This lecture discusses the dual formulation of Support Vector Machines (SVMs), emphasizing Lagrange duality and its application in optimizing SVMs, particularly in high-dimensional feature spaces. It outlines the process of deriving the dual from the primal optimization problem and highlights the equivalence of the primal and dual formulations. Additionally, it addresses practical considerations for non-separable problems by introducing slack variables to allow for constraint violations.

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Lecture 13: Dual Formulation of Support

Vector Machines

Contents
Lecture 13: Dual Formulation of Support Vector Machines
13.1. Lagrange Duality
13.2. Dual Formulation of SVMs
13.3. Practical Considerations for SVM Duals

In this lecture, we will see a different formulation of the SVM called the dual. This
dual formulation will lead to new types of optimization algorithms with favorable
computational properties in scenarios when the number of features is very large
(and possibly even infinite!).

Before we define the dual of the SVM problem, we need to introduce some
additional concepts from optimization, namely Lagrange duality.

13.1.1. Review: Classification Margins

In the previous lecture, we defined the concept of classification margins. Recall
that the margin γ (i) is the distance between the separating hyperplane and the
datapoint x (i) .

Large margins are good, since data should be far from the decision boundary.
Maximizing the margin of a linear model amounts to solving the following
optimization problem:

1
min ||θ|| 2
2
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

We are now going to look at a different way of optimizing this objective. But first,
we need to define Lagrange duality.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 1 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

13.1.2. Lagrange Duality in Constrained

Optimization
We start by introducing the definition of a constrained optimization problem. We
will look at constrained optimization problems of the form

min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K

where J(θ) is the optimization objective and each c k (θ) : R d → R is a constraint.

Our goal is to find a small value of J(θ) such that the c k (θ) are negative. Rather
than solving the above problem, we can solve the following related optimization
problem, which contains additional penalty terms:

K
min L(θ, λ) = J(θ) + ∑ λ k c k (θ)
θ
k=1

This new objective includes an additional vector of Lagrange multipliers

λ ∈ [0, ∞) K , which are positive weights that we place on the constraint terms.
We call L(θ, λ) the Lagrangian. Observe that:

If λ k ≥ 0, then we penalize large values of c k

For large enough λ k , no c k will be positive—a valid solution.

Thus, penalties are another way of enforcing constraints.

13.1.2.1. The Primal Lagrange Form

Consider again our constrained optimization problem:

min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K

We define its primal Lagrange form to be

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0

k=1

These two forms have the same optimum θ ∗ ! The reason for this to be true can be
proved considering the following:

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0

k=1

Observe that:

If a c k is violated (c k > 0) then max λ≥0 L(θ, λ) is ∞ as λ k → ∞.

If no c k is violated and c k < 0 then the optimal λ k = 0 (any bigger value

makes the inner objective smaller).

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 2 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

If c k < 0 for all k then λ k = 0 for all k and $

min θ∈Rd P(θ) = min θ∈Rd max λ≥0 L(θ, λ) = min θ∈Rd J(θ)$

Thus, min θ∈Rd P(θ) is the solution to our original optimization problem.

13.1.2.2. The Dual Lagrange Form

Now consider the following problem over λ ≥ 0: $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) = max λ≥0 min θ∈Rd (J(θ) + ∑ K
k=1 λ k c k (θ)).
$

We call this the Lagrange dual of the primal optimization problem min θ∈Rd P(θ).
We can always construct a dual for the primal.

13.1.2.3. Lagrange Duality

Once we have constructed a dual for the primal, the dual would be interesting
because we always have: $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) ≤ min θ∈Rd max λ≥0 L(θ, λ) = min θ∈Rd P(θ)
$

Moreover, in many cases, we have $max λ≥0 D(λ) = min θ∈Rd P(θ).$ Thus, the
primal and the dual are equivalent! This is very important and we will use this
feature for moving into the next steps of solving SVMs.

13.1.3. An Aside: Constrained Regularization

Before we move on to defining the dual form of SVMs, we want to make a brief side
comment on the related topic of constrained regularization. Consider a regularized
supervised learning problem with a penalty term: $min θ∈Θ L(θ) + γ ⋅ R(θ).$

We may also enforce an explicit constraint on the complexity of the model:

min L(θ)
θ∈Θ
such that R(θ) ≤ γ ′

We will not prove this, but solving this problem is equivalent so solving the
penalized problem for some γ > 0 that’s different from γ ′ . In other words, we can
regularize by explicitly enforcing R(θ) to be less than a value or we can penalize
R(θ).

Let’s now apply Lagrange duality to support vector machines.

13.2.1. Review: Max-Margin Classification

First, let’s briefly reintroduce the task of binary classification using a linear model
and a max-margin objective.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 3 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Consider a training dataset D = {(x (1) , y (1) ), (x (2) , y (2) ), … , (x (n) , y (n) )}. We
distinguish between two types of supervised learning problems depending on the
targets y (i) .

1. Regression: The target variable y ∈ Y is continuous: Y ⊆ R.

2. Binary Classification: The target variable y is discrete and takes on one of
K = 2 possible values.

In this lecture, we assume Y = {−1, +1}. We will also work with linear models of
the form:

f θ (x) = θ 0 + θ 1 ⋅ x 1 + θ 2 ⋅ x 2 +. . . +θ d ⋅ x d

where x ∈ R d is a vector of features and y ∈ {−1, 1} is the target. The θ j are the
parameters of the model.

We can represent the model in a vectorized form

f θ (x) = θ ⊤ x + θ 0 .

We define the geometric margin γ (i) with respect to a training example (x (i) , y (i) )
as $γ (i) = y (i) ( θ x +θ0 ).T hisalsocorrespondstothedistancefromx^{(i)}$
⊤ (i)

||θ||
to the hyperplane.

We saw that maximizing the margin of a linear model amounts to solving the
following optimization problem.

1
min ||θ|| 2
θ,θ 0 2

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

13.2.2. The Dual of the SVM Problem

Let’s now derive the SVM dual. Consider the following objective, the Lagrangian of
the max-margin optimization problem.

n
1
L(θ, θ 0 , λ) = ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
2 i=1

We have put each constraint inside the objective function and added a penalty λ i
to it.

Recall that the following formula is the Lagrange dual of the primal optimization
problem min θ∈Rd P(θ). We can always construct a dual for the primal. $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) = max λ≥0 min θ∈Rd (J(θ) + ∑ K
k=1 λ k c k (θ)).
$

It is easy to write out the dual form of the max-margin problem. Consider
optimizing the above Lagrangian over θ, θ 0 for any value of λ.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 4 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

min L(θ, θ 0 , λ) = min ( ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )))

n
1
θ,θ 0 θ,θ 0 2 i=1

This objective is quadratic in θ; hence it has a single minimum in θ.

We can find it by setting the derivative to zero and solving for θ, θ 0 :

n
θ = ∑ λ i y (i) x (i)
i=1
n
0 = ∑ λ i y (i)
i=1

Substituting this into the Lagrangian we obtain the following expression for the
dual max λ≥0 D(λ) = max λ≥0 min θ,θ0 L(θ, θ 0 , λ):

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i

13.2.3. Properties of SVM Duals

Recall that in general, we have:

max D(λ) = max min L(θ, λ) ≤ min max L(θ, λ) = min P(θ)
λ≥0 λ≥0 θ∈R d θ∈R d λ≥0 θ∈R d

In the case of the SVM problem, one can show that

max D(λ) = min P(θ).

λ≥0 θ∈R d

Thus, the primal and the dual are equivalent!

We can also make several other observations about this dual:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and λ i ≥ 0 for all i
i=1

This is a constrained quadratic optimization problem.

The number of variables λ i equals n, the number of data points.
Objective only depends on products (x (i) ) ⊤ x (j) (keep reading for more on
this!)

13.2.3.1. When to Solve the Dual

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 5 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

An interesting question arises when we need to decide which optimization problem

to solve: the dual or the primal. In short, the deciding factor is the number of
features (the dimensionality of x) relative to the number of datapoints:

The dimensionality of the primal depends on the number of features. If we

have a few features and many datapoints, we should use the primal.
Conversely, if we have a lot of features, but fewer datapoints, we want to use
the dual.

In the next lecture, we will see how we can use this property to solve machine
learning problems with a very large number of features (even possibly infinite!).

In this part, we will continue our discussion of the dual formulation of the SVM with
additional practical details.

Recall that the the max-margin hyperplane can be formulated as the solution to the
following primal optimization problem.

1
min ||θ|| 2
2
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

The solution to this problem also happens to be given by the following dual
problem:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i

13.3.1. Non-Separable Problems

Our dual problem assumes that a separating hyperplane exists. If it doesn’t, our
optimization problem does not have a solution, and we need to modify it. Our
approach is going to be to make each constraint “soft”, by introducing “slack”
variables, which allow the constraint to be violated.

y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i .

In the optimization problem, we assign a penalty C to these slack variables to

obtain:

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2
i=1
subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i
ξi ≥ 0

This is the primal problem. Let’s now form its dual. First, the Lagrangian
L(λ, µ, θ, θ 0 , ξ) equals

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 6 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n n n
1
||θ|| 2 + C ∑ ξ i − ∑ λ i (y (i) ((x (i) ) ⊤ θ + θ 0 ) − 1) − ∑ µ i ξ i .
2 i=1 i=1 i=1

The dual objective of this problem will equal

D(λ, µ) = min L(λ, µ, θ, θ 0 , ξ).

θ,θ 0 ,ξ

As earlier, we can solve for the optimal θ, θ 0 in closed form and plug back the
resulting values into the objective. We can then show that the dual takes the
following form:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i

13.3.2. Sequential Minimal Optimization and

Coordinate Descent
Coordinate descent is a general way to optimize functions f(x) of multiple
variables x ∈ R d . It executes as:

1. Choose a dimension j ∈ {1, 2, … , d}.

2. Optimize f(x 1 , x 2 , … , x j , … , x d ) over x j while keeping the other
variables fixed.

Here, we visualize coordinate descent applied to a 2D quadratic function.

The red line shows the trajectory of coordinate descent. Each "step" in the
trajectory is an iteration of the algorithm. Image from Wikipedia.
We can apply a form of coordinate descent to solve the dual:

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 7 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and C ≥ λ i ≥ 0 for all i
i=1

A popular, efficient algorithm is Sequential Minimal Optimization (SMO), which

executes as:

Take a pair λ i , λ j , possibly using heuristics to guide choice of i, j.

Reoptimize over λ i , λ j while keeping the other variables fixed.
Repeat the above until convergence.

13.3.3. Obtaining a Primal Solution from the

Dual
Next, assuming we can solve the dual, how do we find a separating hyperplane
θ, θ 0 ?

Recall that we already found an expression for the optimal θ ∗ (in the separable
case) as a function of λ:

n
θ ∗ = ∑ λ i y (i) x (i) .
i=1

Once we know θ ∗ it easy to check that the solution to θ 0 is given by

max i:y(i) =−1 (θ ∗ ) ⊤ x (i) + min i:y(i) =−1 (θ ∗ ) ⊤ x (i)

θ ∗0 =− .
2

13.3.4. Support Vectors

A powerful property of the SVM dual is that at the optimum, most variables λ i are
zero! Thus, θ is a sum of a small number of points:

n
θ = ∑ λ i y (i) x (i) .
∗

i=1

The points for which λ i > 0 are precisely the points that lie on the margin (are
closest to the hyperplane).

These are called support vectors, and this is where the SVM algorithm takes its
name. We are going to illustrate the concept of an SVM using a figure.

13.3.5. A Hands-On Example

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 8 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Let’s look at a concrete example of how to use the dual version of the SVM. In this
example, we are going to again use the Iris flower dataset. We will merge two of
the three classes to make it suitable for binary classification.

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the Iris dataset

iris = datasets.load_iris(as_frame=True)
iris_X, iris_y = iris.data, iris.target

# subsample to a third of the data points

iris_X = iris_X.loc[::4]
iris_y = iris_y.loc[::4]

# create a binary classification dataset with labels +/- 1

iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1
iris_y2[iris_y2==0] = -1

# print part of the dataset

pd.concat([iris_X, iris_y2], axis=1).head()

sepal length sepal width petal length petal width

target
(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2 -1

4 5.0 3.6 1.4 0.2 -1

8 4.4 2.9 1.4 0.2 -1

12 4.8 3.0 1.4 0.1 -1

16 5.4 3.9 1.3 0.4 -1

Let’s visualize this dataset.

# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")

# create 2d version of dataset and subsample it

X = iris_X.to_numpy()[:,:2]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min,
y_max, .02))

# Plot also the training points

p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=60,
cmap=plt.cm.Paired)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Not
Setosa'], loc='lower right')

<matplotlib.legend.Legend at 0x7f8311ce5550>

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 9 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

We can run the dual version of the SVM by importing an implementation from
sklearn:

#https://scikit-
learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from sklearn import svm

# fit the model, don't regularize for illustration purposes

clf = svm.SVC(kernel='linear', C=1000) # this optimizes the dual
# clf = svm.LinearSVC() # this optimizes for the primal
clf.fit(X, iris_y2)

plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)

Z = clf.decision_function(np.c_[xx.ravel(),
yy.ravel()]).reshape(xx.shape)

# plot decision boundary and margins

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.xlim([4.6, 6])
plt.ylim([2.25, 4])
plt.show()

We can see that the solid line defines the decision boundary, and the two dotted
lines are the geometric margin.

The data points that fall on the margin are the support vectors. Notice that only
these vectors determine the position of the hyperplane. If we “wiggle” any of the
other points, the margin remains unchanged—therefore the max-margin
hyperplane also remains unchanged. However, moving the support vectors
changes both the optimal margin and the optimal hyperplane.

This observation provides an intuitive explanation for the formula

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 10 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n
θ ∗ = ∑ λ i y (i) x (i) .
i=1

In this formula, λ i > 0 only for the x (i) that are support vectors. Hence, only these
x (i) influence the position of the hyperplane, which matches our earlier intuition.

13.3.6. Algorithm: Support Vector Machine

Classification (Dual Form)
In summary, the SVM algorithm can be succinctly defined by the following key
components.

Type: Supervised learning (binary classification)

Model family: Linear decision boundaries.
Objective function: Dual of SVM optimization problem.
Optimizer: Sequential minimal optimization.
Probabilistic interpretation: No simple interpretation!

In the next lecture, we will combine dual SVMs with a new idea called kernels,
which enable them to handle a very large number of features (and even an infinite
number of features) without any additional computational cost.

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 11 of 11

Part II: Lagrange Multiplier Method & Karush-Kuhn-Tucker (KKT) Conditions
No ratings yet
Part II: Lagrange Multiplier Method & Karush-Kuhn-Tucker (KKT) Conditions
5 pages
Business Directory List
No ratings yet
Business Directory List
2 pages
LA Opti Assignment 53.1 Lagrangian Duality
No ratings yet
LA Opti Assignment 53.1 Lagrangian Duality
8 pages
10-1_duality
No ratings yet
10-1_duality
4 pages
Convex - Module A Part 4
No ratings yet
Convex - Module A Part 4
22 pages
Lecture14 KKT
No ratings yet
Lecture14 KKT
37 pages
斯坦福大学机器学习数学基础 57-64
No ratings yet
斯坦福大学机器学习数学基础 57-64
8 pages
Andreani2010 - Constant-Rank Condition and Second-Order Constraint Qualification
No ratings yet
Andreani2010 - Constant-Rank Condition and Second-Order Constraint Qualification
12 pages
Dualidad
No ratings yet
Dualidad
7 pages
week_6_c
No ratings yet
week_6_c
14 pages
Lecture 9 - SVM
No ratings yet
Lecture 9 - SVM
42 pages
8a. Crash Course on Linear Programming - part 2
No ratings yet
8a. Crash Course on Linear Programming - part 2
6 pages
Lecture 5: Primal-Dual Algorithms and Facility Location: 1 Linear Programming Duality
No ratings yet
Lecture 5: Primal-Dual Algorithms and Facility Location: 1 Linear Programming Duality
10 pages
06 Lagrange
No ratings yet
06 Lagrange
4 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
No ratings yet
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
4 pages
Lec 3
No ratings yet
Lec 3
22 pages
Colgen
No ratings yet
Colgen
19 pages
KKT PDF
No ratings yet
KKT PDF
5 pages
Lagrangian Relaxation: An Overview: General Idea
No ratings yet
Lagrangian Relaxation: An Overview: General Idea
4 pages
Linear Programming Duality
No ratings yet
Linear Programming Duality
17 pages
斯坦福大学机器学习数学基础 49-56
No ratings yet
斯坦福大学机器学习数学基础 49-56
8 pages
Lecture 8: Strong Duality: 8.1.1 Primal and Dual Problems
No ratings yet
Lecture 8: Strong Duality: 8.1.1 Primal and Dual Problems
9 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
ELE704_Lecture Notes - II_04!04!2024
No ratings yet
ELE704_Lecture Notes - II_04!04!2024
117 pages
M2 Exam 2022-23 Solutions
No ratings yet
M2 Exam 2022-23 Solutions
12 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
Bias Variance
No ratings yet
Bias Variance
3 pages
Optimal Dispatch of Generation Part I: Unconstrained Parameter Optimization
No ratings yet
Optimal Dispatch of Generation Part I: Unconstrained Parameter Optimization
9 pages
Admm Homework
No ratings yet
Admm Homework
5 pages
OR2 IntegerProgramming
No ratings yet
OR2 IntegerProgramming
18 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
11 pages
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
No ratings yet
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
4 pages
Constrained LSQ
No ratings yet
Constrained LSQ
8 pages
Addendum Bias Variance
No ratings yet
Addendum Bias Variance
3 pages
ps3
No ratings yet
ps3
5 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
9.1 Optimization
No ratings yet
9.1 Optimization
29 pages
Introduction To Optimization: Anjela Govan North Carolina State University SAMSI NDHS Undergraduate Workshop 2006
No ratings yet
Introduction To Optimization: Anjela Govan North Carolina State University SAMSI NDHS Undergraduate Workshop 2006
29 pages
HW 4
No ratings yet
HW 4
6 pages
05 Lecture - ILP-and-duality
No ratings yet
05 Lecture - ILP-and-duality
8 pages
Karush-Kuhn-Tucker Conditions
No ratings yet
Karush-Kuhn-Tucker Conditions
5 pages
I. Introduction To Convex Optimization
No ratings yet
I. Introduction To Convex Optimization
12 pages
Lecture 03
No ratings yet
Lecture 03
8 pages
Lab Report #1: Transient Stability Analysis For Single Machine Infinite Bus Bar Using MATLAB
50% (2)
Lab Report #1: Transient Stability Analysis For Single Machine Infinite Bus Bar Using MATLAB
5 pages
Chapter 4 - Constrained Optimization
No ratings yet
Chapter 4 - Constrained Optimization
13 pages
paper (1)
No ratings yet
paper (1)
10 pages
MS&E 318 (CME 338) Large-Scale Numerical Optimization: Stanford University, Management Science & Engineering (And ICME)
No ratings yet
MS&E 318 (CME 338) Large-Scale Numerical Optimization: Stanford University, Management Science & Engineering (And ICME)
6 pages
Sample Research Paper
No ratings yet
Sample Research Paper
26 pages
Mathematics of Operation Research A Note On Linear Programming Problems
No ratings yet
Mathematics of Operation Research A Note On Linear Programming Problems
6 pages
1 Explicit Solution To An Irreversible Investment Model With A Stochastic Production Capacity
No ratings yet
1 Explicit Solution To An Irreversible Investment Model With A Stochastic Production Capacity
15 pages
Convex Optimization Overview (CNT'D) : 1 Recap
No ratings yet
Convex Optimization Overview (CNT'D) : 1 Recap
15 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
01 Intro Notes Cvxopt f22
No ratings yet
01 Intro Notes Cvxopt f22
25 pages
LS Paper
No ratings yet
LS Paper
18 pages
Const Opt
No ratings yet
Const Opt
22 pages
eb80bd28ceaeba2608039a00db04292b_lec19
No ratings yet
eb80bd28ceaeba2608039a00db04292b_lec19
8 pages
Chapter 4 Duality
No ratings yet
Chapter 4 Duality
19 pages
Calc 1 Practice
No ratings yet
Calc 1 Practice
5 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
From Everand
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
Mohmmad Khaja Shareef
No ratings yet
Instant Access to Criminal Law A Comparative Approach 1st Edition Markus Dubber ebook Full Chapters
100% (6)
Instant Access to Criminal Law A Comparative Approach 1st Edition Markus Dubber ebook Full Chapters
81 pages
Penetrometros para Suelos Analogos 08180 Agratronix Manual Ingles
No ratings yet
Penetrometros para Suelos Analogos 08180 Agratronix Manual Ingles
8 pages
Ar. Basic Architectural Notes
No ratings yet
Ar. Basic Architectural Notes
84 pages
Me401A Lab Report: Experiment Number: 5 Group Number: B2
No ratings yet
Me401A Lab Report: Experiment Number: 5 Group Number: B2
9 pages
Functional programming in Java
No ratings yet
Functional programming in Java
4 pages
Pre Recruitment Medical Result
No ratings yet
Pre Recruitment Medical Result
9 pages
Uts Bahasa Inggris Kelas Reguler 2 (Selasa)
No ratings yet
Uts Bahasa Inggris Kelas Reguler 2 (Selasa)
2 pages
Hypersperse Mdc704i-Msds
No ratings yet
Hypersperse Mdc704i-Msds
9 pages
Chatbot final report
No ratings yet
Chatbot final report
20 pages
Phase 5
No ratings yet
Phase 5
26 pages
Tempering Chocolate - by Pastry Chef - Author Eddy Van Damme
100% (1)
Tempering Chocolate - by Pastry Chef - Author Eddy Van Damme
6 pages
Offer of Service - PSAC
No ratings yet
Offer of Service - PSAC
3 pages
Copy of Unit -3 Industrial Environment
No ratings yet
Copy of Unit -3 Industrial Environment
58 pages
Energy Law
No ratings yet
Energy Law
4 pages
RRB 100 Important History Questions - Download in PDF
No ratings yet
RRB 100 Important History Questions - Download in PDF
20 pages
Literature Review On Bacteriological Analysis of Well Water
100% (1)
Literature Review On Bacteriological Analysis of Well Water
9 pages
Deep Fake Live Latest Update(Works With Windows 11)
No ratings yet
Deep Fake Live Latest Update(Works With Windows 11)
2 pages
Sample Paper 1
No ratings yet
Sample Paper 1
8 pages
Om Asu12r2 FR
No ratings yet
Om Asu12r2 FR
19 pages
T-Way Error Code Etim
No ratings yet
T-Way Error Code Etim
196 pages
Energy Audit Checklist
No ratings yet
Energy Audit Checklist
5 pages
Big Data: Prospects and Challenges: Colloquium
No ratings yet
Big Data: Prospects and Challenges: Colloquium
23 pages
4d3n Manila-Villa Escudero Tour
No ratings yet
4d3n Manila-Villa Escudero Tour
4 pages
Okunola Bolanle Kudirat: Customer Statement
No ratings yet
Okunola Bolanle Kudirat: Customer Statement
4 pages
1.5.monopolistic Competition
No ratings yet
1.5.monopolistic Competition
13 pages
Form 1
No ratings yet
Form 1
1 page
Case Study 2
No ratings yet
Case Study 2
2 pages
Formula Student Germany - Chassis Stiffness and Compliance PDF
No ratings yet
Formula Student Germany - Chassis Stiffness and Compliance PDF
2 pages
Nha Tondo Manila - Less 20t
No ratings yet
Nha Tondo Manila - Less 20t
17 pages

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

Uploaded by

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

Uploaded by

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Lecture 13: Dual Formulation of Support

13.1.1. Review: Classification Margins

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

13.1.2. Lagrange Duality in Constrained

where J(θ) is the optimization objective and each c k (θ) : R d → R is a constraint.

This new objective includes an additional vector of Lagrange multipliers

If λ k ≥ 0, then we penalize large values of c k

Thus, penalties are another way of enforcing constraints.

13.1.2.1. The Primal Lagrange Form

We define its primal Lagrange form to be

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0

If a c k is violated (c k > 0) then max λ≥0 L(θ, λ) is ∞ as λ k → ∞.

If no c k is violated and c k < 0 then the optimal λ k = 0 (any bigger value

If c k < 0 for all k then λ k = 0 for all k and $

13.1.2.2. The Dual Lagrange Form

13.1.2.3. Lagrange Duality

13.1.3. An Aside: Constrained Regularization

We may also enforce an explicit constraint on the complexity of the model:

Let’s now apply Lagrange duality to support vector machines.

13.2.1. Review: Max-Margin Classification

1. Regression: The target variable y ∈ Y is continuous: Y ⊆ R.

We can represent the model in a vectorized form

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

13.2.2. The Dual of the SVM Problem

min L(θ, θ 0 , λ) = min ( ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )))

This objective is quadratic in θ; hence it has a single minimum in θ.

We can find it by setting the derivative to zero and solving for θ, θ 0 :

13.2.3. Properties of SVM Duals

In the case of the SVM problem, one can show that

max D(λ) = min P(θ).

Thus, the primal and the dual are equivalent!

We can also make several other observations about this dual:

This is a constrained quadratic optimization problem.

13.2.3.1. When to Solve the Dual

An interesting question arises when we need to decide which optimization problem

The dimensionality of the primal depends on the number of features. If we

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

13.3.1. Non-Separable Problems

y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i .

In the optimization problem, we assign a penalty C to these slack variables to

The dual objective of this problem will equal

D(λ, µ) = min L(λ, µ, θ, θ 0 , ξ).

13.3.2. Sequential Minimal Optimization and

1. Choose a dimension j ∈ {1, 2, … , d}.

Here, we visualize coordinate descent applied to a 2D quadratic function.

A popular, efficient algorithm is Sequential Minimal Optimization (SMO), which

Take a pair λ i , λ j , possibly using heuristics to guide choice of i, j.

13.3.3. Obtaining a Primal Solution from the

Once we know θ ∗ it easy to check that the solution to θ 0 is given by

max i:y(i) =−1 (θ ∗ ) ⊤ x (i) + min i:y(i) =−1 (θ ∗ ) ⊤ x (i)

13.3.4. Support Vectors

13.3.5. A Hands-On Example

# Load the Iris dataset

# subsample to a third of the data points

# create a binary classification dataset with labels +/- 1

# print part of the dataset

sepal length sepal width petal length petal width

0 5.1 3.5 1.4 0.2 -1

4 5.0 3.6 1.4 0.2 -1

8 4.4 2.9 1.4 0.2 -1

12 4.8 3.0 1.4 0.1 -1

16 5.4 3.9 1.3 0.4 -1

Let’s visualize this dataset.

# create 2d version of dataset and subsample it

# Plot also the training points

# fit the model, don't regularize for illustration purposes

plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)

# plot decision boundary and margins

This observation provides an intuitive explanation for the formula

13.3.6. Algorithm: Support Vector Machine

Type: Supervised learning (binary classification)

You might also like