Lecture 12: Support Vector Machines — Applied ML
Lecture 12: Support Vector Machines — Applied ML
Contents
Lecture 12: Support Vector Machines
12.1. Classification Margins
12.2. The Max-Margin Classifier
12.2.2. Algorithm: Linear Support Vector Machine Classification
12.3. Soft Margins and the Hinge Loss
12.4. Optimization for SVMs
In this lecture, we are going to cover support vector machines (SVMs), one the
most successful classification algorithms in machine learning.
f θ (x) = θ 0 + θ 1 ⋅ x 1 + θ 2 ⋅ x 2 +. . . +θ d ⋅ x d
where x ∈ R d is a vector of features and y ∈ {−1, 1} is the target. The θ j are the
parameters of the model. We can represent the model in a vectorized form as
f θ (x) = θ ⊤ x + θ 0 .
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 1 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
In this lecture, we will again use the Iris flower dataset. We will transform this
problem into a binary classification task by merging the two non-Setosa flowers
into one class. We use Y = {−1, 1} as the label space.
import numpy as np
import pandas as pd
from sklearn import datasets
# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")
<matplotlib.legend.Legend at 0x12b01cd30>
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 2 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
Consider the following three classification algorithms from sklearn. Each of them
outputs a different classification boundary.
def fit_and_create_boundary(model):
model.fit(X, iris_y2)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z
plt.figure(figsize=(12,3))
for i, model in enumerate(models):
plt.subplot('13%d' % (i+1))
Z = fit_and_create_boundary(model)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.show()
If the class probability is > 0.5, the model outputs class 1. The score is an
estimate of confidence; it also represents how far we are from the decision
boundary.
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 3 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
Intuitively, we want to select boundaries with high margin. This means that we are
as confident as possible for every point and we are as far as possible from the
decision boundary.
Several of the separating boundaries in our previous example had low margin: they
came too close to the boundary.
def fit_and_create_boundary(model):
model.fit(X, iris_y2)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z
plt.figure(figsize=(12,3))
for i, model in enumerate(models):
plt.subplot('13%d' % (i+1))
Z = fit_and_create_boundary(model)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.show()
Below, we plot a decision boundary between the two classes (solid line) that has a
high margin. The two dashed lines that lie at the margin.
Points that are the margin are highlighted in black. A good decision boundary is as
far away as possible from the points at the margin.
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 4 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
#https://scikit-
learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from sklearn import svm
plt.figure(figsize=(5,5))
plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)
Z = clf.decision_function(np.c_[xx.ravel(),
yy.ravel()]).reshape(xx.shape)
(2.25, 4.0)
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 5 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
Thus higher margin means higher confidence at each input point. However, we
have a problem.
It doesn’t make sense that the same classification boundary can have different
margins when we rescale it.
θ ⊤ x (i) + θ 0
γ (i) = y (i) ( ).
||θ||
We call it geometric because γ (i) equals the distance between x (i) and the
hyperplane.
θ ⊤ x (i) + θ 0
γ (i) = y (i) ( ).
||θ||
If y (i) = 1, then the margin γ (i) is large when the model score
f(x (i) ) = θ ⊤ x (i) + θ 0 is positive and large.
The same holds when y (i) = −1. We again capture our intuition that
increasing margin means increasing the confidence of each input point.
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 6 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
1. The points x that lie on the decision boundary are those for which
θ ⊤ x + θ 0 = 0 (score is precisely zero, and between 1 and -1).
θ is perpendicular to the hyperplane ⊤
1. The vector ||θ|| θ x + θ 0 and has unit
norm (fact from calculus).
1. Let x 0 be the point on the boundary closest to x (i) . Then by definition of the
margin x (i) = x 0 + γ (i) ||θ||
θ or
θ
x 0 = x (i) − γ (i) .
||θ||
θ ⊤ (x (i) − γ (i) ) + θ 0 = 0.
θ
||θ||
1. Solving for γ (i) and using the fact that θ ⊤ θ = ||θ|| 2 , we obtain
θ ⊤ x (i) + θ 0
γ (i) = .
||θ||
Which is our geometric margin. The case of y (i) = −1 can also be proven in a
similar way.
We can use our formula for γ to precisely plot the margins on our earlier plot.
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 7 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
We have seen a way to measure the confidence level of a classifier at a data point
using the notion of a margin. Next, we are going to see how to maximize the margin
of linear classifiers.
max γ
θ,θ 0 ,γ
(x (i) ) ⊤ θ + θ 0
subject to y (i) ≥ γ for all i
||θ||
This maximises the smallest margin over the (x (i) , y (i) ). It guarantees each point
has margin at least γ.
This problem is difficult to optimize because of the division by ||θ|| and we would
like to simplify it. First, consider the equivalent problem:
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 8 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
max γ
θ,θ 0 ,γ
||θ|| ⋅ γ = 1.
This ensures we cannot rescale θ and also asks our linear model to assign each
x (i) a score of at least ±1:
1
max
||θ||
θ,θ 0
1
min ||θ|| 2
2
θ,θ 0
The above procedure describes the closed solution for Support Vector Machine.
We can succinctly define the algorithm components.
So far, we have assume that a linear hyperplane exists. However, what if the
classes are non-separable? Then our optimization problem does not have a
solution and we need to modify it.
n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1
n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2
i=1
This yields:
+
ξ i = max (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ), 0) := (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
+
Since ξ i = (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) , we can take
n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 10 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
n
1 +
min ||θ|| 2 + C ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
θ,θ 0 2 i=1
n +
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) +
λ
||θ|| 2
θ,θ 0 ,ξ
i=1 2
+ n
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) + ||θ|| 2
λ
θ,θ 0
i=1 2
hinge loss regularizer
If prediction f has same sign as y, and |f| ≥ 1, the loss is zero. In other
words, if the class is correct, no penalty is applied if the absolute value of the
score f is greater than 1.
# plot them
fs = np.linspace(0, 2)
plt.plot(fs, l1_loss(fs), fs, l2_loss(fs), fs, hinge_loss(fs),
linewidth=9, alpha=0.5)
plt.legend(['L1 Loss', 'L2 Loss', 'Hinge Loss'])
plt.xlabel('Prediction f')
plt.ylabel('L(y=1,f)')
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 11 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
<matplotlib.legend.Legend at 0x12e750a58>
We have seen a new way to formulate the SVM objective. Let’s now see how to
optimize it.
12.4.0 Review
12.4.0.1. Review: SVM Objective
Maximizing the margin can be done in the following form:
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 12 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
+ n
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) + ||θ|| 2
λ
θ,θ 0 ,ξ
i=1 2
hinge loss regularizer
We can easily implement this objective in numpy. First we define the model.
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional data matrix
Returns:
y_pred (np.array): n-dimensional vector of predicted targets
"""
return X.dot(theta)
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
"""
return (np.maximum(1 - y * f(X, theta), 0) + C * 0.5 *
np.linalg.norm(theta[:-1])**2).mean()
Here, you see the linear part of J that behaves like 1 − y ⋅ f θ (x) (when
y ⋅ f θ (x) < 1) in orange:
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 13 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
<matplotlib.legend.Legend at 0x12ea6f940>
When y ⋅ f θ (x) < 1, we are in the “orange line” part and J(θ) behaves like
1 − y ⋅ f θ (x).
∇ θ J(θ) = −y ⋅ ∇f θ (x) = −y ⋅ x
where we used ∇ θ θ ⊤ x = x.
When y ⋅ f θ (x) ≥ 1, we are in the “flat” part and J(θ) = 0. Hence the gradient is
also just zero!
When y ⋅ f θ (x) = 1, we are in the “kink”, and the gradient is not defined!
In practice, we can either take the gradient when y ⋅ f θ (x) > 1 or the
gradient when y ⋅ f θ (x) < 1 or anything in between. This is called the
subgradient.
It equals:
~ −y ⋅ x if y ⋅ f θ (x) < 1
∇ θ J(θ) = {
0 otherwise
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 14 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
Returns:
subgradient (np.array): d-dimensional subgradient
"""
yy = y.copy()
yy[y*f(X,theta)>=1] = 0
subgradient = np.mean(-yy * X.T, axis=1)
subgradient[:-1] += C * theta[:-1]
return subgradient
threshold = 5e-4
step_size = 1e-2
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 15 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
Iteration 0. J: 3.728947
Iteration 1000. J: 0.376952
Iteration 2000. J: 0.359075
Iteration 3000. J: 0.351587
Iteration 4000. J: 0.344411
Iteration 5000. J: 0.337912
Iteration 6000. J: 0.331617
Iteration 7000. J: 0.326604
Iteration 8000. J: 0.322224
Iteration 9000. J: 0.319250
Iteration 10000. J: 0.316727
Iteration 11000. J: 0.314800
Iteration 12000. J: 0.313181
Iteration 13000. J: 0.311843
Iteration 14000. J: 0.310667
Iteration 15000. J: 0.309561
Iteration 16000. J: 0.308496
Iteration 17000. J: 0.307523
Iteration 18000. J: 0.306614
Iteration 19000. J: 0.305768
Iteration 20000. J: 0.305068
Iteration 21000. J: 0.304293
plt.show()
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 16 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM
By Cornell University
© Copyright 2023.
https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 17 of 17