0% found this document useful (0 votes)
2 views

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

This lecture discusses the dual formulation of Support Vector Machines (SVMs), emphasizing Lagrange duality and its application in optimizing SVMs, particularly in high-dimensional feature spaces. It outlines the process of deriving the dual from the primal optimization problem and highlights the equivalence of the primal and dual formulations. Additionally, it addresses practical considerations for non-separable problems by introducing slack variables to allow for constraint violations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

This lecture discusses the dual formulation of Support Vector Machines (SVMs), emphasizing Lagrange duality and its application in optimizing SVMs, particularly in high-dimensional feature spaces. It outlines the process of deriving the dual from the primal optimization problem and highlights the equivalence of the primal and dual formulations. Additionally, it addresses practical considerations for non-separable problems by introducing slack variables to allow for constraint violations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Lecture 13: Dual Formulation of Support


Vector Machines

Contents
Lecture 13: Dual Formulation of Support Vector Machines
13.1. Lagrange Duality
13.2. Dual Formulation of SVMs
13.3. Practical Considerations for SVM Duals

In this lecture, we will see a different formulation of the SVM called the dual. This
dual formulation will lead to new types of optimization algorithms with favorable
computational properties in scenarios when the number of features is very large
(and possibly even infinite!).

Before we define the dual of the SVM problem, we need to introduce some
additional concepts from optimization, namely Lagrange duality.

13.1.1. Review: Classification Margins


In the previous lecture, we defined the concept of classification margins. Recall
that the margin γ (i) is the distance between the separating hyperplane and the
datapoint x (i) .

Large margins are good, since data should be far from the decision boundary.
Maximizing the margin of a linear model amounts to solving the following
optimization problem:

1
min ||θ|| 2
2
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

We are now going to look at a different way of optimizing this objective. But first,
we need to define Lagrange duality.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 1 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

13.1.2. Lagrange Duality in Constrained


Optimization
We start by introducing the definition of a constrained optimization problem. We
will look at constrained optimization problems of the form

min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K

where J(θ) is the optimization objective and each c k (θ) : R d → R is a constraint.

Our goal is to find a small value of J(θ) such that the c k (θ) are negative. Rather
than solving the above problem, we can solve the following related optimization
problem, which contains additional penalty terms:

K
min L(θ, λ) = J(θ) + ∑ λ k c k (θ)
θ
k=1

This new objective includes an additional vector of Lagrange multipliers


λ ∈ [0, ∞) K , which are positive weights that we place on the constraint terms.
We call L(θ, λ) the Lagrangian. Observe that:

If λ k ≥ 0, then we penalize large values of c k


For large enough λ k , no c k will be positive—a valid solution.

Thus, penalties are another way of enforcing constraints.

13.1.2.1. The Primal Lagrange Form


Consider again our constrained optimization problem:

min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K

We define its primal Lagrange form to be

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))


K

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0


k=1

These two forms have the same optimum θ ∗ ! The reason for this to be true can be
proved considering the following:

min P(θ) = min max L(θ, λ) = min max (J(θ) + ∑ λ k c k (θ))


K

θ∈R d θ∈R d λ≥0 θ∈R d λ≥0


k=1

Observe that:

If a c k is violated (c k > 0) then max λ≥0 L(θ, λ) is ∞ as λ k → ∞.

If no c k is violated and c k < 0 then the optimal λ k = 0 (any bigger value


makes the inner objective smaller).

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 2 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

If c k < 0 for all k then λ k = 0 for all k and $


min θ∈Rd P(θ) = min θ∈Rd max λ≥0 L(θ, λ) = min θ∈Rd J(θ)$

Thus, min θ∈Rd P(θ) is the solution to our original optimization problem.

13.1.2.2. The Dual Lagrange Form


Now consider the following problem over λ ≥ 0: $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) = max λ≥0 min θ∈Rd (J(θ) + ∑ K
k=1 λ k c k (θ)).
$

We call this the Lagrange dual of the primal optimization problem min θ∈Rd P(θ).
We can always construct a dual for the primal.

13.1.2.3. Lagrange Duality


Once we have constructed a dual for the primal, the dual would be interesting
because we always have: $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) ≤ min θ∈Rd max λ≥0 L(θ, λ) = min θ∈Rd P(θ)
$

Moreover, in many cases, we have $max λ≥0 D(λ) = min θ∈Rd P(θ).$ Thus, the
primal and the dual are equivalent! This is very important and we will use this
feature for moving into the next steps of solving SVMs.

13.1.3. An Aside: Constrained Regularization


Before we move on to defining the dual form of SVMs, we want to make a brief side
comment on the related topic of constrained regularization. Consider a regularized
supervised learning problem with a penalty term: $min θ∈Θ L(θ) + γ ⋅ R(θ).$

We may also enforce an explicit constraint on the complexity of the model:

min L(θ)
θ∈Θ
such that R(θ) ≤ γ ′

We will not prove this, but solving this problem is equivalent so solving the
penalized problem for some γ > 0 that’s different from γ ′ . In other words, we can
regularize by explicitly enforcing R(θ) to be less than a value or we can penalize
R(θ).

Let’s now apply Lagrange duality to support vector machines.

13.2.1. Review: Max-Margin Classification


First, let’s briefly reintroduce the task of binary classification using a linear model
and a max-margin objective.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 3 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Consider a training dataset D = {(x (1) , y (1) ), (x (2) , y (2) ), … , (x (n) , y (n) )}. We
distinguish between two types of supervised learning problems depending on the
targets y (i) .

1. Regression: The target variable y ∈ Y is continuous: Y ⊆ R.


2. Binary Classification: The target variable y is discrete and takes on one of
K = 2 possible values.

In this lecture, we assume Y = {−1, +1}. We will also work with linear models of
the form:

f θ (x) = θ 0 + θ 1 ⋅ x 1 + θ 2 ⋅ x 2 +. . . +θ d ⋅ x d

where x ∈ R d is a vector of features and y ∈ {−1, 1} is the target. The θ j are the
parameters of the model.

We can represent the model in a vectorized form

f θ (x) = θ ⊤ x + θ 0 .

We define the geometric margin γ (i) with respect to a training example (x (i) , y (i) )
as $γ (i) = y (i) ( θ x +θ0 ).T hisalsocorrespondstothedistancefromx^{(i)}$
⊤ (i)

||θ||
to the hyperplane.

We saw that maximizing the margin of a linear model amounts to solving the
following optimization problem.

1
min ||θ|| 2
θ,θ 0 2

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

13.2.2. The Dual of the SVM Problem


Let’s now derive the SVM dual. Consider the following objective, the Lagrangian of
the max-margin optimization problem.

n
1
L(θ, θ 0 , λ) = ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
2 i=1

We have put each constraint inside the objective function and added a penalty λ i
to it.

Recall that the following formula is the Lagrange dual of the primal optimization
problem min θ∈Rd P(θ). We can always construct a dual for the primal. $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) = max λ≥0 min θ∈Rd (J(θ) + ∑ K
k=1 λ k c k (θ)).
$

It is easy to write out the dual form of the max-margin problem. Consider
optimizing the above Lagrangian over θ, θ 0 for any value of λ.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 4 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

min L(θ, θ 0 , λ) = min ( ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )))


n
1
θ,θ 0 θ,θ 0 2 i=1

This objective is quadratic in θ; hence it has a single minimum in θ.

We can find it by setting the derivative to zero and solving for θ, θ 0 :

n
θ = ∑ λ i y (i) x (i)
i=1
n
0 = ∑ λ i y (i)
i=1

Substituting this into the Lagrangian we obtain the following expression for the
dual max λ≥0 D(λ) = max λ≥0 min θ,θ0 L(θ, θ 0 , λ):

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i

13.2.3. Properties of SVM Duals


Recall that in general, we have:

max D(λ) = max min L(θ, λ) ≤ min max L(θ, λ) = min P(θ)
λ≥0 λ≥0 θ∈R d θ∈R d λ≥0 θ∈R d

In the case of the SVM problem, one can show that

max D(λ) = min P(θ).


λ≥0 θ∈R d

Thus, the primal and the dual are equivalent!

We can also make several other observations about this dual:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and λ i ≥ 0 for all i
i=1

This is a constrained quadratic optimization problem.


The number of variables λ i equals n, the number of data points.
Objective only depends on products (x (i) ) ⊤ x (j) (keep reading for more on
this!)

13.2.3.1. When to Solve the Dual

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 5 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

An interesting question arises when we need to decide which optimization problem


to solve: the dual or the primal. In short, the deciding factor is the number of
features (the dimensionality of x) relative to the number of datapoints:

The dimensionality of the primal depends on the number of features. If we


have a few features and many datapoints, we should use the primal.
Conversely, if we have a lot of features, but fewer datapoints, we want to use
the dual.

In the next lecture, we will see how we can use this property to solve machine
learning problems with a very large number of features (even possibly infinite!).

In this part, we will continue our discussion of the dual formulation of the SVM with
additional practical details.

Recall that the the max-margin hyperplane can be formulated as the solution to the
following primal optimization problem.

1
min ||θ|| 2
2
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

The solution to this problem also happens to be given by the following dual
problem:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i

13.3.1. Non-Separable Problems


Our dual problem assumes that a separating hyperplane exists. If it doesn’t, our
optimization problem does not have a solution, and we need to modify it. Our
approach is going to be to make each constraint “soft”, by introducing “slack”
variables, which allow the constraint to be violated.

y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i .

In the optimization problem, we assign a penalty C to these slack variables to


obtain:

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2
i=1
subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i
ξi ≥ 0

This is the primal problem. Let’s now form its dual. First, the Lagrangian
L(λ, µ, θ, θ 0 , ξ) equals

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 6 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n n n
1
||θ|| 2 + C ∑ ξ i − ∑ λ i (y (i) ((x (i) ) ⊤ θ + θ 0 ) − 1) − ∑ µ i ξ i .
2 i=1 i=1 i=1

The dual objective of this problem will equal

D(λ, µ) = min L(λ, µ, θ, θ 0 , ξ).


θ,θ 0 ,ξ

As earlier, we can solve for the optimal θ, θ 0 in closed form and plug back the
resulting values into the objective. We can then show that the dual takes the
following form:

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i

13.3.2. Sequential Minimal Optimization and


Coordinate Descent
Coordinate descent is a general way to optimize functions f(x) of multiple
variables x ∈ R d . It executes as:

1. Choose a dimension j ∈ {1, 2, … , d}.


2. Optimize f(x 1 , x 2 , … , x j , … , x d ) over x j while keeping the other
variables fixed.

Here, we visualize coordinate descent applied to a 2D quadratic function.

The red line shows the trajectory of coordinate descent. Each "step" in the
trajectory is an iteration of the algorithm. Image from Wikipedia.
We can apply a form of coordinate descent to solve the dual:

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 7 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and C ≥ λ i ≥ 0 for all i
i=1

A popular, efficient algorithm is Sequential Minimal Optimization (SMO), which


executes as:

Take a pair λ i , λ j , possibly using heuristics to guide choice of i, j.


Reoptimize over λ i , λ j while keeping the other variables fixed.
Repeat the above until convergence.

13.3.3. Obtaining a Primal Solution from the


Dual
Next, assuming we can solve the dual, how do we find a separating hyperplane
θ, θ 0 ?

Recall that we already found an expression for the optimal θ ∗ (in the separable
case) as a function of λ:

n
θ ∗ = ∑ λ i y (i) x (i) .
i=1

Once we know θ ∗ it easy to check that the solution to θ 0 is given by

max i:y(i) =−1 (θ ∗ ) ⊤ x (i) + min i:y(i) =−1 (θ ∗ ) ⊤ x (i)


θ ∗0 =− .
2

13.3.4. Support Vectors


A powerful property of the SVM dual is that at the optimum, most variables λ i are
zero! Thus, θ is a sum of a small number of points:

n
θ = ∑ λ i y (i) x (i) .

i=1

The points for which λ i > 0 are precisely the points that lie on the margin (are
closest to the hyperplane).

These are called support vectors, and this is where the SVM algorithm takes its
name. We are going to illustrate the concept of an SVM using a figure.

13.3.5. A Hands-On Example

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 8 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

Let’s look at a concrete example of how to use the dual version of the SVM. In this
example, we are going to again use the Iris flower dataset. We will merge two of
the three classes to make it suitable for binary classification.

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the Iris dataset


iris = datasets.load_iris(as_frame=True)
iris_X, iris_y = iris.data, iris.target

# subsample to a third of the data points


iris_X = iris_X.loc[::4]
iris_y = iris_y.loc[::4]

# create a binary classification dataset with labels +/- 1


iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1
iris_y2[iris_y2==0] = -1

# print part of the dataset


pd.concat([iris_X, iris_y2], axis=1).head()

sepal length sepal width petal length petal width


target
(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2 -1

4 5.0 3.6 1.4 0.2 -1

8 4.4 2.9 1.4 0.2 -1

12 4.8 3.0 1.4 0.1 -1

16 5.4 3.9 1.3 0.4 -1

Let’s visualize this dataset.

# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")

# create 2d version of dataset and subsample it


X = iris_X.to_numpy()[:,:2]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min,
y_max, .02))

# Plot also the training points


p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=60,
cmap=plt.cm.Paired)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Not
Setosa'], loc='lower right')

<matplotlib.legend.Legend at 0x7f8311ce5550>

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 9 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

We can run the dual version of the SVM by importing an implementation from
sklearn:

#https://scikit-
learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from sklearn import svm

# fit the model, don't regularize for illustration purposes


clf = svm.SVC(kernel='linear', C=1000) # this optimizes the dual
# clf = svm.LinearSVC() # this optimizes for the primal
clf.fit(X, iris_y2)

plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)


Z = clf.decision_function(np.c_[xx.ravel(),
yy.ravel()]).reshape(xx.shape)

# plot decision boundary and margins


plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.xlim([4.6, 6])
plt.ylim([2.25, 4])
plt.show()

We can see that the solid line defines the decision boundary, and the two dotted
lines are the geometric margin.

The data points that fall on the margin are the support vectors. Notice that only
these vectors determine the position of the hyperplane. If we “wiggle” any of the
other points, the margin remains unchanged—therefore the max-margin
hyperplane also remains unchanged. However, moving the support vectors
changes both the optimal margin and the optimal hyperplane.

This observation provides an intuitive explanation for the formula

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 10 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM

n
θ ∗ = ∑ λ i y (i) x (i) .
i=1

In this formula, λ i > 0 only for the x (i) that are support vectors. Hence, only these
x (i) influence the position of the hyperplane, which matches our earlier intuition.

13.3.6. Algorithm: Support Vector Machine


Classification (Dual Form)
In summary, the SVM algorithm can be succinctly defined by the following key
components.

Type: Supervised learning (binary classification)


Model family: Linear decision boundaries.
Objective function: Dual of SVM optimization problem.
Optimizer: Sequential minimal optimization.
Probabilistic interpretation: No simple interpretation!

In the next lecture, we will combine dual SVMs with a new idea called kernels,
which enable them to handle a very large number of features (and even an infinite
number of features) without any additional computational cost.

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 11 of 11

You might also like