Lecture 13: Dual Formulation of Support Vector Machines — Applied ML
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML
Contents
Lecture 13: Dual Formulation of Support Vector Machines
13.1. Lagrange Duality
13.2. Dual Formulation of SVMs
13.3. Practical Considerations for SVM Duals
In this lecture, we will see a different formulation of the SVM called the dual. This
dual formulation will lead to new types of optimization algorithms with favorable
computational properties in scenarios when the number of features is very large
(and possibly even infinite!).
Before we define the dual of the SVM problem, we need to introduce some
additional concepts from optimization, namely Lagrange duality.
Large margins are good, since data should be far from the decision boundary.
Maximizing the margin of a linear model amounts to solving the following
optimization problem:
1
min ||θ|| 2
2
θ,θ 0
We are now going to look at a different way of optimizing this objective. But first,
we need to define Lagrange duality.
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 1 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K
Our goal is to find a small value of J(θ) such that the c k (θ) are negative. Rather
than solving the above problem, we can solve the following related optimization
problem, which contains additional penalty terms:
K
min L(θ, λ) = J(θ) + ∑ λ k c k (θ)
θ
k=1
min J(θ)
θ∈R d
such that c k (θ) ≤ 0 for k = 1, 2, … , K
These two forms have the same optimum θ ∗ ! The reason for this to be true can be
proved considering the following:
Observe that:
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 2 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
Thus, min θ∈Rd P(θ) is the solution to our original optimization problem.
We call this the Lagrange dual of the primal optimization problem min θ∈Rd P(θ).
We can always construct a dual for the primal.
Moreover, in many cases, we have $max λ≥0 D(λ) = min θ∈Rd P(θ).$ Thus, the
primal and the dual are equivalent! This is very important and we will use this
feature for moving into the next steps of solving SVMs.
min L(θ)
θ∈Θ
such that R(θ) ≤ γ ′
We will not prove this, but solving this problem is equivalent so solving the
penalized problem for some γ > 0 that’s different from γ ′ . In other words, we can
regularize by explicitly enforcing R(θ) to be less than a value or we can penalize
R(θ).
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 3 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
Consider a training dataset D = {(x (1) , y (1) ), (x (2) , y (2) ), … , (x (n) , y (n) )}. We
distinguish between two types of supervised learning problems depending on the
targets y (i) .
In this lecture, we assume Y = {−1, +1}. We will also work with linear models of
the form:
f θ (x) = θ 0 + θ 1 ⋅ x 1 + θ 2 ⋅ x 2 +. . . +θ d ⋅ x d
where x ∈ R d is a vector of features and y ∈ {−1, 1} is the target. The θ j are the
parameters of the model.
f θ (x) = θ ⊤ x + θ 0 .
We define the geometric margin γ (i) with respect to a training example (x (i) , y (i) )
as $γ (i) = y (i) ( θ x +θ0 ).T hisalsocorrespondstothedistancefromx^{(i)}$
⊤ (i)
||θ||
to the hyperplane.
We saw that maximizing the margin of a linear model amounts to solving the
following optimization problem.
1
min ||θ|| 2
θ,θ 0 2
n
1
L(θ, θ 0 , λ) = ||θ|| 2 + ∑ λ i (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
2 i=1
We have put each constraint inside the objective function and added a penalty λ i
to it.
Recall that the following formula is the Lagrange dual of the primal optimization
problem min θ∈Rd P(θ). We can always construct a dual for the primal. $
max λ≥0 D(λ) = max λ≥0 min θ∈Rd L(θ, λ) = max λ≥0 min θ∈Rd (J(θ) + ∑ K
k=1 λ k c k (θ)).
$
It is easy to write out the dual form of the max-margin problem. Consider
optimizing the above Lagrangian over θ, θ 0 for any value of λ.
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 4 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
n
θ = ∑ λ i y (i) x (i)
i=1
n
0 = ∑ λ i y (i)
i=1
Substituting this into the Lagrangian we obtain the following expression for the
dual max λ≥0 D(λ) = max λ≥0 min θ,θ0 L(θ, θ 0 , λ):
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i
max D(λ) = max min L(θ, λ) ≤ min max L(θ, λ) = min P(θ)
λ≥0 λ≥0 θ∈R d θ∈R d λ≥0 θ∈R d
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and λ i ≥ 0 for all i
i=1
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 5 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
In the next lecture, we will see how we can use this property to solve machine
learning problems with a very large number of features (even possibly infinite!).
In this part, we will continue our discussion of the dual formulation of the SVM with
additional practical details.
Recall that the the max-margin hyperplane can be formulated as the solution to the
following primal optimization problem.
1
min ||θ|| 2
2
θ,θ 0
The solution to this problem also happens to be given by the following dual
problem:
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
λ i ≥ 0 for all i
n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2
i=1
subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i
ξi ≥ 0
This is the primal problem. Let’s now form its dual. First, the Lagrangian
L(λ, µ, θ, θ 0 , ξ) equals
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 6 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
n n n
1
||θ|| 2 + C ∑ ξ i − ∑ λ i (y (i) ((x (i) ) ⊤ θ + θ 0 ) − 1) − ∑ µ i ξ i .
2 i=1 i=1 i=1
As earlier, we can solve for the optimal θ, θ 0 in closed form and plug back the
resulting values into the objective. We can then show that the dual takes the
following form:
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i
The red line shows the trajectory of coordinate descent. Each "step" in the
trajectory is an iteration of the algorithm. Image from Wikipedia.
We can apply a form of coordinate descent to solve the dual:
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 7 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0 and C ≥ λ i ≥ 0 for all i
i=1
Recall that we already found an expression for the optimal θ ∗ (in the separable
case) as a function of λ:
n
θ ∗ = ∑ λ i y (i) x (i) .
i=1
n
θ = ∑ λ i y (i) x (i) .
∗
i=1
The points for which λ i > 0 are precisely the points that lie on the margin (are
closest to the hyperplane).
These are called support vectors, and this is where the SVM algorithm takes its
name. We are going to illustrate the concept of an SVM using a figure.
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 8 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
Let’s look at a concrete example of how to use the dual version of the SVM. In this
example, we are going to again use the Iris flower dataset. We will merge two of
the three classes to make it suitable for binary classification.
import numpy as np
import pandas as pd
from sklearn import datasets
# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")
<matplotlib.legend.Legend at 0x7f8311ce5550>
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 9 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
We can run the dual version of the SVM by importing an implementation from
sklearn:
#https://scikit-
learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from sklearn import svm
We can see that the solid line defines the decision boundary, and the two dotted
lines are the geometric margin.
The data points that fall on the margin are the support vectors. Notice that only
these vectors determine the position of the hyperplane. If we “wiggle” any of the
other points, the margin remains unchanged—therefore the max-margin
hyperplane also remains unchanged. However, moving the support vectors
changes both the optimal margin and the optimal hyperplane.
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 10 of 11
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML 20/10/24, 6:22 PM
n
θ ∗ = ∑ λ i y (i) x (i) .
i=1
In this formula, λ i > 0 only for the x (i) that are support vectors. Hence, only these
x (i) influence the position of the hyperplane, which matches our earlier intuition.
In the next lecture, we will combine dual SVMs with a new idea called kernels,
which enable them to handle a very large number of features (and even an infinite
number of features) without any additional computational cost.
By Cornell University
© Copyright 2023.
https://kuleshov-group.github.io/aml-book/contents/lecture13-svm-dual.html Page 11 of 11