0% found this document useful (0 votes)

4 views

Lecture 12: Support Vector Machines — Applied ML

Lecture 12 focuses on Support Vector Machines (SVMs), a powerful classification algorithm in machine learning, emphasizing concepts such as classification margins, the max-margin classifier, and optimization techniques for SVMs. The lecture uses the Iris dataset to illustrate binary classification and compares various classification algorithms, highlighting the importance of selecting decision boundaries with high margins for better confidence in predictions. It also introduces the functional and geometric margins, providing a formal definition and visual representation of these concepts.

Uploaded by

thepalaceartisanfoundation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lecture 12: Support Vector Machines — Applied ML

Uploaded by

thepalaceartisanfoundation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Lecture 12: Support Vector Machines

Contents
Lecture 12: Support Vector Machines
12.1. Classification Margins
12.2. The Max-Margin Classifier
12.2.2. Algorithm: Linear Support Vector Machine Classification
12.3. Soft Margins and the Hinge Loss
12.4. Optimization for SVMs

In this lecture, we are going to cover support vector machines (SVMs), one the
most successful classification algorithms in machine learning.

We start the presentation of SVMs by defining the classification margin.

12.1.1. Review and Motivation

12.1.1.1. Review of Binary Classification
Consider a training dataset D = {(x (1) , y (1) ), (x (2) , y (2) ), … , (x (n) , y (n) )}.
Recall that we distinguish between two types of supervised learning problems
depending on the targets y (i) .

1. Regression: The target variable y ∈ Y is continuous: Y ⊆ R.

2. Binary Classification: The target variable y is discrete and takes on one of
K = 2 possible values.

In this lecture, we focus on binary classification and assume Y = {−1, +1}.

Linear Model Family

In this lecture, we will work with linear models of the form:

f θ (x) = θ 0 + θ 1 ⋅ x 1 + θ 2 ⋅ x 2 +. . . +θ d ⋅ x d

where x ∈ R d is a vector of features and y ∈ {−1, 1} is the target. The θ j are the
parameters of the model. We can represent the model in a vectorized form as

f θ (x) = θ ⊤ x + θ 0 .

12.1.1.2. Binary Classification Problem and The Iris Dataset

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 1 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

In this lecture, we will again use the Iris flower dataset. We will transform this
problem into a binary classification task by merging the two non-Setosa flowers
into one class. We use Y = {−1, 1} as the label space.

The resulting dataset is partly shown below.

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the Iris dataset

iris = datasets.load_iris(as_frame=True)
iris_X, iris_y = iris.data, iris.target

# subsample to a third of the data points

iris_X = iris_X.loc[::4]
iris_y = iris_y.loc[::4]

# create a binary classification dataset with labels +/- 1

iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1
iris_y2[iris_y2==0] = -1

# print part of the dataset

pd.concat([iris_X, iris_y2], axis=1).head()

sepal length sepal width petal length petal width

target
(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2 -1

4 5.0 3.6 1.4 0.2 -1

8 4.4 2.9 1.4 0.2 -1

12 4.8 3.0 1.4 0.1 -1

16 5.4 3.9 1.3 0.4 -1

As in earlier lectures, we visualize this dataset using matplotlib.

# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")

# create 2d version of dataset and subsample it

X = iris_X.to_numpy()[:,:2]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min,
y_max, .02))

# Plot also the training points

p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=60,
cmap=plt.cm.Paired)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Not
Setosa'], loc='lower right')

<matplotlib.legend.Legend at 0x12b01cd30>

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 2 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

12.1.2. Comparing Classification Algorithms

We have seen different types approaches to classification. When fitting a model,
there may be many valid decision boundaries. How do we select one of them?

Consider the following three classification algorithms from sklearn. Each of them
outputs a different classification boundary.

from sklearn.linear_model import LogisticRegression, Perceptron,

RidgeClassifier
models = [LogisticRegression(), Perceptron(), RidgeClassifier()]

def fit_and_create_boundary(model):
model.fit(X, iris_y2)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z

plt.figure(figsize=(12,3))
for i, model in enumerate(models):
plt.subplot('13%d' % (i+1))
Z = fit_and_create_boundary(model)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=iris_y2, edgecolors='k',
cmap=plt.cm.Paired)
plt.title('Algorithm %d' % (i+1))
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

12.1.2.1. Classification Scores

Most classification algorithms output not just a class label but a score. For
example, logistic regression returns the class probability

p(y = 1| ∣ x) = σ(θ ⊤ x) ∈ [0, 1]

If the class probability is > 0.5, the model outputs class 1. The score is an
estimate of confidence; it also represents how far we are from the decision
boundary.

12.1.2.2. The Max-Margin Principle

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 3 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Intuitively, we want to select boundaries with high margin. This means that we are
as confident as possible for every point and we are as far as possible from the
decision boundary.

Several of the separating boundaries in our previous example had low margin: they
came too close to the boundary.

from sklearn.linear_model import Perceptron, RidgeClassifier

from sklearn.svm import SVC
models = [SVC(kernel='linear', C=10000), Perceptron(),
RidgeClassifier()]

def fit_and_create_boundary(model):
model.fit(X, iris_y2)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z

plt.figure(figsize=(12,3))
for i, model in enumerate(models):
plt.subplot('13%d' % (i+1))
Z = fit_and_create_boundary(model)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=iris_y2, edgecolors='k',
cmap=plt.cm.Paired)
if i == 0:
plt.title('Good Margin')
else:
plt.title('Bad Margin')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

Below, we plot a decision boundary between the two classes (solid line) that has a
high margin. The two dashed lines that lie at the margin.

Points that are the margin are highlighted in black. A good decision boundary is as
far away as possible from the points at the margin.

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 4 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

#https://scikit-
learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from sklearn import svm

# fit the model, don't regularize for illustration purposes

clf = svm.SVC(kernel='linear', C=1000) # we'll explain this
algorithm shortly
clf.fit(X, iris_y2)

plt.figure(figsize=(5,5))
plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)
Z = clf.decision_function(np.c_[xx.ravel(),
yy.ravel()]).reshape(xx.shape)

# plot decision boundary and margins

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.xlim([4.6, 6])
plt.ylim([2.25, 4])

(2.25, 4.0)

12.1.3. The Functional Classification Margin

How can we define the concept of margin more formally?

~ (i) with respect to a training example (x (i) , y (i) )

We can try to define the margin γ
as

γ~ (i) = y (i) ⋅ f(x (i) ) = y (i) ⋅ (θ ⊤ x (i) + θ 0 ).

We call this the functional margin. Let’s analyze it.

We defined the functional margin as

γ~ (i) = y (i) ⋅ (θ ⊤ x (i) + θ 0 ).

~ (i) is large when the model score

If y (i) = 1, then the margin γ
f(x (i) ) = θ ⊤ x (i) + θ 0 is positive and large.

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 5 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Thus, we are classifying x (i) correctly and with high confidence.

~ (i) is large when the model score

If y (i) = −1, then the margin γ
f(x (i) ) = θ ⊤ x (i) + θ 0 is negative and large in absolute value.

We are again classifying x (i) correctly and with high confidence.

Thus higher margin means higher confidence at each input point. However, we
have a problem.

If we rescale the parameters θ, θ 0 by a scalar α > 0, we get new parameters

αθ, αθ 0 .
The αθ, αθ 0 doesn’t change the classification of points.
However, the margin (αθ ⊤ x (i) + αθ 0 ) = α (θ ⊤ x (i) + θ 0 ) is now scaled by
α!

It doesn’t make sense that the same classification boundary can have different
margins when we rescale it.

12.1.4. The Geometric Classification Margin

We define the geometric margin γ (i) with respect to a training example (x (i) , y (i) )
as

θ ⊤ x (i) + θ 0
γ (i) = y (i) ( ).
||θ||

We call it geometric because γ (i) equals the distance between x (i) and the
hyperplane.

We normalize the functional margin by ||θ||

Rescaling the weights does not make the margin arbitrarily large.

Let’s make sure our intuition about the margin holds.

θ ⊤ x (i) + θ 0
γ (i) = y (i) ( ).
||θ||

If y (i) = 1, then the margin γ (i) is large when the model score
f(x (i) ) = θ ⊤ x (i) + θ 0 is positive and large.

Thus, we are classifying x (i) correctly and with high confidence.

The same holds when y (i) = −1. We again capture our intuition that
increasing margin means increasing the confidence of each input point.

12.1.4.1. Geometric Intuitions

The margin γ (i) is called geometric because it corresponds to the distance from
x (i) to the separating hyperplane θ ⊤ x + θ 0 = 0 (dashed line below).

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 6 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Suppose that y (i) = 1 (x (i) lies on positive side of boundary). Then:

1. The points x that lie on the decision boundary are those for which
θ ⊤ x + θ 0 = 0 (score is precisely zero, and between 1 and -1).
θ is perpendicular to the hyperplane ⊤
1. The vector ||θ|| θ x + θ 0 and has unit
norm (fact from calculus).

1. Let x 0 be the point on the boundary closest to x (i) . Then by definition of the
margin x (i) = x 0 + γ (i) ||θ||
θ or

θ
x 0 = x (i) − γ (i) .
||θ||

1. Since x 0 is on the hyperplane, θ ⊤ x 0 + θ 0 = 0, or

θ ⊤ (x (i) − γ (i) ) + θ 0 = 0.
θ
||θ||

1. Solving for γ (i) and using the fact that θ ⊤ θ = ||θ|| 2 , we obtain

θ ⊤ x (i) + θ 0
γ (i) = .
||θ||

Which is our geometric margin. The case of y (i) = −1 can also be proven in a
similar way.

We can use our formula for γ to precisely plot the margins on our earlier plot.

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 7 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

# plot decision boundary and margins

plt.figure(figsize=(5,5))
plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)
plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=100,
linewidth=1, facecolors='none', edgecolors='k')
plt.xlim([4.6, 6.1])
plt.ylim([2.25, 4])

# plot margin vectors

theta = clf.coef_[0]
theta0 = clf.intercept_
for idx in clf.support_[:3]:
x0 = X[idx]
y0 = iris_y2.iloc[idx]
margin_x0 = (theta.dot(x0) + theta0)[0] / np.linalg.norm(theta)
w = theta / np.linalg.norm(theta)
plt.plot([x0[0], x0[0]-w[0]*margin_x0], [x0[1], x0[1]-
w[1]*margin_x0], color='blue')
plt.scatter([x0[0]-w[0]*margin_x0], [x0[1]-w[1]*margin_x0],
color='blue')
plt.show()

We have seen a way to measure the confidence level of a classifier at a data point
using the notion of a margin. Next, we are going to see how to maximize the margin
of linear classifiers.

12.2.1. Maximizing the Margin

We want to define an objective that will result in maximizing the margin. As a first
attempt, consider the following optimization problem.

max γ
θ,θ 0 ,γ

(x (i) ) ⊤ θ + θ 0
subject to y (i) ≥ γ for all i
||θ||

This maximises the smallest margin over the (x (i) , y (i) ). It guarantees each point
has margin at least γ.

This problem is difficult to optimize because of the division by ||θ|| and we would
like to simplify it. First, consider the equivalent problem:

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 8 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

max γ
θ,θ 0 ,γ

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ γ||θ|| for all i

Note that this problem has an extra degree of freedom:

Suppose we multiply θ, θ 0 by some constant c > 0.

This yields another valid solution!

To enforce uniqueness, we add another constraint that doesn’t change the

minimizer:

||θ|| ⋅ γ = 1.

This ensures we cannot rescale θ and also asks our linear model to assign each
x (i) a score of at least ±1:

y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

If we constraint ||θ|| ⋅ γ = 1 holds, then we know that γ = 1/||θ|| and we can

replace γ in the optimization problem to obtain:

1
max
||θ||
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

The solution of this problem is still the same.

Finally, instead of maximizing 1/||θ||, we can minimize ||θ||, or equivalently we can

minimize 1 ||θ|| 2 .
2

1
min ||θ|| 2
2
θ,θ 0

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 for all i

This is now a quadratic program that can be solved using off-the-shelf

optimization algorithms!

The above procedure describes the closed solution for Support Vector Machine.
We can succinctly define the algorithm components.

Type: Supervised learning (binary classification).

Model family: Linear decision boundaries.
Objective function: Max-margin optimization.
Optimizer: Quadratic optimization algorithms.
Probabilistic interpretation: No simple interpretation!

Later, we will see several other versions of this algorithm.

Let’s continue looking at how we can maximize the margin.

12.3.1. Non-Separable Problems

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 9 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

So far, we have assume that a linear hyperplane exists. However, what if the
classes are non-separable? Then our optimization problem does not have a
solution and we need to modify it.

Our solution is going to be to make each constraint “soft”, by introducing “slack”

variables, which allow the constraint to be violated.

y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i .

If we can classify each point with a perfect score of ≥ 1, the ξ i = 0.

If we cannot assign a perfect score, we assign a score of 1 − ξ i .
We define optimization such that the ξ i are chosen to be as small as
possible.

In the optimization problem, we assign a penalty C to these slack variables to

obtain:

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1

subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i

ξi ≥ 0

12.3.2. Towards an Unconstrained Objective

Let’s further modify things. Moving around terms in the inequality we get:

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2
i=1

subject to ξ i ≥ 1 − y (i) ((x (i) ) ⊤ θ + θ 0 ) ξ i ≥ 0 for all i

If 0 ≥ 1 − y (i) ((x (i) ) ⊤ θ + θ 0 ), we classified x (i) perfectly and ξ i = 0

If 0 < 1 − y (i) ((x (i) ) ⊤ θ + θ 0 ), then ξ i = 1 − y (i) ((x (i) ) ⊤ θ + θ 0 )

Thus, ξ i = max (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ), 0).

We simplify notation a bit by using the notation (x) + = max(x, 0).

This yields:

+
ξ i = max (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ), 0) := (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))

+
Since ξ i = (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) , we can take

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1

subject to ξ i ≥ 1 − y (i) ((x (i) ) ⊤ θ + θ 0 ) ξ i ≥ 0 for all i

And we turn it into the following by plugging in the definition of ξ i :

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 10 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

n
1 +
min ||θ|| 2 + C ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 ))
θ,θ 0 2 i=1

Since it doesn’t matter which term we multiply by C > 0, this is equivalent to

n +
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) +
λ
||θ|| 2
θ,θ 0 ,ξ
i=1 2

for some λ > 0.

We have now turned our optimizatin problem into an unconstrained form:

+ n
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) + ||θ|| 2
λ




θ,θ 0
i=1    2  
hinge loss regularizer

The hinge loss penalizes incorrect predictions.

The L2 regularizer ensures the weights are small and well-behaved.

12.3.3. The Hinge Loss

Consider again our new loss term for a label y and a prediction f :

L(y, f) = max (1 − y ⋅ f, 0).

Let’s examine the behavior of this loss on different y, f :

If prediction f has same sign as y, and |f| ≥ 1, the loss is zero. In other
words, if the class is correct, no penalty is applied if the absolute value of the
score f is greater than 1.

However, if the prediction f is of the wrong sign, or |f| ≤ 1, the loss is

|y − f|. Thus, we penalize incorrect predictions, or predictions that are too
close to the midpoint between the two class labels (which is at zero, since
the labels are ±1).

Let’s visualize a few losses L(y = 1, f), as a function of f , including hinge.

# define the losses for a target of y=1

hinge_loss = lambda f: np.maximum(1 - f, 0)
l2_loss = lambda f: (1-f)**2
l1_loss = lambda f: np.abs(f-1)

# plot them
fs = np.linspace(0, 2)
plt.plot(fs, l1_loss(fs), fs, l2_loss(fs), fs, hinge_loss(fs),
linewidth=9, alpha=0.5)
plt.legend(['L1 Loss', 'L2 Loss', 'Hinge Loss'])
plt.xlabel('Prediction f')
plt.ylabel('L(y=1,f)')

Text(0, 0.5, 'L(y=1,f)')

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 11 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

We can make a few interesting observations:

The hinge loss is linear like the L1 loss.

But it only penalizes errors that are on the “wrong” side:
We have an error of |f − y| if true class is 1 and f < 1
We don’t penalize for predicting f > 1 if true class is 1.

plt.plot(fs, hinge_loss(fs), linewidth=9, alpha=0.5)

plt.legend(['Hinge Loss'])

<matplotlib.legend.Legend at 0x12e750a58>

12.3.3.1. Properties of the Hinge Loss

The hinge loss is one of the best losses in machine learning. We summarize here
several important properties of the hinge loss.

It penalizes errors “that matter,” hence is less sensitive to outliers.

Minimizing a regularized hinge loss optimizes for a high margin.
The loss is non-differentiable at point, which may make it more challenging to
optimize.

We have seen a new way to formulate the SVM objective. Let’s now see how to
optimize it.

12.4.0 Review
12.4.0.1. Review: SVM Objective
Maximizing the margin can be done in the following form:

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 12 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

+ n
min ∑ (1 − y (i) ((x (i) ) ⊤ θ + θ 0 )) + ||θ|| 2
λ




θ,θ 0 ,ξ
i=1    2  
hinge loss regularizer

The hinge loss penalizes incorrect predictions.

The L2 regularizer ensures the weights are small and well-behaved.

We can easily implement this objective in numpy. First we define the model.

def f(X, theta):

"""The linear model we are trying to fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional data matrix

Returns:
y_pred (np.array): n-dimensional vector of predicted targets
"""
return X.dot(theta)

And then we define the objective.

def svm_objective(theta, X, y, C=.1):

"""The cost function, J, describing the goodness of fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
"""
return (np.maximum(1 - y * f(X, theta), 0) + C * 0.5 *
np.linalg.norm(theta[:-1])**2).mean()

12.4.0.2. Review: Gradient Descent

If we want to optimize J(θ), we start with an initial guess θ 0 for the parameters
and repeat the following update:

θ i := θ i−1 − α ⋅ ∇ θ J(θ i−1 ).

As code, this method may look as follows:

theta, theta_prev = random_initialization()

while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)

12.4.1. A Gradient for the Hinge Loss?

What is the gradient for the hinge loss with a linear f ?

J(θ) = max (1 − y ⋅ f θ (x), 0) = max (1 − y ⋅ θ ⊤ x, 0).

Here, you see the linear part of J that behaves like 1 − y ⋅ f θ (x) (when
y ⋅ f θ (x) < 1) in orange:

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 13 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

plt.plot(fs, hinge_loss(fs),fs[:25], hinge_loss(fs[:25]),

linewidth=9, alpha=0.5)
plt.legend(['Hinge Loss', 'Hinge Loss when $y \cdot f < 1$'])

<matplotlib.legend.Legend at 0x12ea6f940>

When y ⋅ f θ (x) < 1, we are in the “orange line” part and J(θ) behaves like
1 − y ⋅ f θ (x).

Hence the gradient in this regime is:

∇ θ J(θ) = −y ⋅ ∇f θ (x) = −y ⋅ x

where we used ∇ θ θ ⊤ x = x.

When y ⋅ f θ (x) ≥ 1, we are in the “flat” part and J(θ) = 0. Hence the gradient is
also just zero!

What is the gradient for the hinge loss with a linear f ?

J(θ) = max (1 − y ⋅ f θ (x), 0) = max (1 − y ⋅ θ ⊤ x, 0).

When y ⋅ f θ (x) = 1, we are in the “kink”, and the gradient is not defined!

In practice, we can either take the gradient when y ⋅ f θ (x) > 1 or the
gradient when y ⋅ f θ (x) < 1 or anything in between. This is called the
subgradient.

12.4.1.1. A Steepest Descent Direction for the Hinge

Loss
~
We can define a “gradient” like function ∇ θ J(θ) for the hinge loss

J(θ) = max (1 − y ⋅ f θ (x), 0) = max (1 − y ⋅ θ ⊤ x, 0).

It equals:

~ −y ⋅ x if y ⋅ f θ (x) < 1
∇ θ J(θ) = {
0 otherwise

12.4.2. (Sub-)Gradient Descent for SVM

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 14 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Putting this together, we obtain a gradient descent algorithm (technically, it’s

called subgradient descent).

theta, theta_prev = random_initialization()

while abs(J(theta) - J(theta_prev)) > conv_threshold:
theta_prev = theta
theta = theta_prev - step_size * approximate_gradient

Let’s implement this algorithm.

First we implement the approximate gradient.

def svm_gradient(theta, X, y, C=.1):

"""The (approximate) gradient of the cost function.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets

Returns:
subgradient (np.array): d-dimensional subgradient
"""
yy = y.copy()
yy[y*f(X,theta)>=1] = 0
subgradient = np.mean(-yy * X.T, axis=1)
subgradient[:-1] += C * theta[:-1]
return subgradient

And then we implement subgradient descent.

threshold = 5e-4
step_size = 1e-2

theta, theta_prev = np.ones((3,)), np.zeros((3,))

iter = 0
iris_X['one'] = 1
X_train = iris_X.iloc[:,[0,1,-1]].to_numpy()
y_train = iris_y2.to_numpy()

while np.linalg.norm(theta - theta_prev) > threshold:

if iter % 1000 == 0:
print('Iteration %d. J: %.6f' % (iter, svm_objective(theta,
X_train, y_train)))
theta_prev = theta
gradient = svm_gradient(theta, X_train, y_train)
theta = theta_prev - step_size * gradient
iter += 1

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 15 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Iteration 0. J: 3.728947
Iteration 1000. J: 0.376952
Iteration 2000. J: 0.359075
Iteration 3000. J: 0.351587
Iteration 4000. J: 0.344411
Iteration 5000. J: 0.337912
Iteration 6000. J: 0.331617
Iteration 7000. J: 0.326604
Iteration 8000. J: 0.322224
Iteration 9000. J: 0.319250
Iteration 10000. J: 0.316727
Iteration 11000. J: 0.314800
Iteration 12000. J: 0.313181
Iteration 13000. J: 0.311843
Iteration 14000. J: 0.310667
Iteration 15000. J: 0.309561
Iteration 16000. J: 0.308496
Iteration 17000. J: 0.307523
Iteration 18000. J: 0.306614
Iteration 19000. J: 0.305768
Iteration 20000. J: 0.305068
Iteration 21000. J: 0.304293

We can visualize the results to convince ourselves we found a good boundary.

xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min,

y_max, .02))
Z = f(np.c_[xx.ravel(), yy.ravel(), np.ones(xx.ravel().shape)],
theta)
Z[Z<0] = 0
Z[Z>0] = 1

# Put the result into a color plot

Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k',
cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

12.4.3. Algorithm: Linear Support Vector

Machine Classification
The above procedure describes gradient descent optimization for support vector
machines. The algorithm card below summarizes this algorithm and its
components.

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 16 of 17
Lecture 12: Support Vector Machines — Applied ML 20/10/24, 6:21 PM

Type: Supervised learning (binary classification)

Model family: Linear decision boundaries.
Objective function: L2-regularized hinge loss.
Optimizer: Subgradient descent.
Probabilistic interpretation: No simple interpretation!

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector-machines.html Page 17 of 17

Process Scheduling Problems
95% (40)
Process Scheduling Problems
18 pages
Cours 12 _ Machines à vecteurs de support — ML appliqué
No ratings yet
Cours 12 _ Machines à vecteurs de support — ML appliqué
14 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
Support Vector Machines
No ratings yet
Support Vector Machines
25 pages
Support Vector Machines: 1 What's SVM
No ratings yet
Support Vector Machines: 1 What's SVM
25 pages
Support Vector Machines: CS229 Lecture Notes
100% (2)
Support Vector Machines: CS229 Lecture Notes
25 pages
cs229 SVM Notes
No ratings yet
cs229 SVM Notes
20 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Support Vector Machines
No ratings yet
Support Vector Machines
27 pages
7. Perceptron
No ratings yet
7. Perceptron
23 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
ML Lectures - 20 22
No ratings yet
ML Lectures - 20 22
14 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
svm
No ratings yet
svm
33 pages
10 SVM
No ratings yet
10 SVM
23 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
svm using iris dataset by hyparlink
No ratings yet
svm using iris dataset by hyparlink
19 pages
Svm Student
No ratings yet
Svm Student
40 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
Support Vector Machines: Javier B Ejar Cbea
No ratings yet
Support Vector Machines: Javier B Ejar Cbea
44 pages
SVM Class
No ratings yet
SVM Class
33 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
SVM
No ratings yet
SVM
17 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
15 SVM
No ratings yet
15 SVM
61 pages
Ex 6,EX 7 AIML
No ratings yet
Ex 6,EX 7 AIML
9 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Example Problems On Support Vector Machines: Problem 1
No ratings yet
Example Problems On Support Vector Machines: Problem 1
2 pages
Supervised Alg
No ratings yet
Supervised Alg
27 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Support Vector Machine
No ratings yet
Support Vector Machine
33 pages
2024-SCU-ML-2-1-SVM
No ratings yet
2024-SCU-ML-2-1-SVM
36 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
UNIT - 2-1
No ratings yet
UNIT - 2-1
7 pages
SVM
No ratings yet
SVM
40 pages
Module 6-Svm Ppt
No ratings yet
Module 6-Svm Ppt
47 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Machine Learning-4
No ratings yet
Machine Learning-4
18 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
Lecture Slides-Week11
No ratings yet
Lecture Slides-Week11
32 pages
Lecture Slides Week11
No ratings yet
Lecture Slides Week11
33 pages
SVM
No ratings yet
SVM
57 pages
svm and kernel
No ratings yet
svm and kernel
57 pages
svm2 (1)fin
No ratings yet
svm2 (1)fin
24 pages
16 SVM
No ratings yet
16 SVM
41 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
Support Vector Machines For Classification: A Seminar On Data Mining
No ratings yet
Support Vector Machines For Classification: A Seminar On Data Mining
18 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Rızvanoğlu2019 Article OptimizationOfMunicipalSolidWa
No ratings yet
Rızvanoğlu2019 Article OptimizationOfMunicipalSolidWa
12 pages
Error Analysis
No ratings yet
Error Analysis
34 pages
Difference Between FDM Fem and FVM
No ratings yet
Difference Between FDM Fem and FVM
1 page
Financial Time Series Forecasting With Deep Learning a Systematic Literature Review 2005 2019
No ratings yet
Financial Time Series Forecasting With Deep Learning a Systematic Literature Review 2005 2019
64 pages
Chapter Two Introduction To Linear Programming
No ratings yet
Chapter Two Introduction To Linear Programming
62 pages
L03 Preprocessing Smoothing V2
No ratings yet
L03 Preprocessing Smoothing V2
31 pages
Tugas Kelompok Matlan
No ratings yet
Tugas Kelompok Matlan
3 pages
GMAT Word Problems: Practice From The Official Guides
No ratings yet
GMAT Word Problems: Practice From The Official Guides
2 pages
Problem Solving: Algorithms and Flowcharts: CSC 110-Introduction To Computer Systems
No ratings yet
Problem Solving: Algorithms and Flowcharts: CSC 110-Introduction To Computer Systems
21 pages
CEDRON - OLA-LP Model Formulation
No ratings yet
CEDRON - OLA-LP Model Formulation
7 pages
Chapter Four - Interpolation (Newton Divided Difference)
No ratings yet
Chapter Four - Interpolation (Newton Divided Difference)
32 pages
FEM - TYME - Lecture 2 PDF
No ratings yet
FEM - TYME - Lecture 2 PDF
5 pages
Ip MCQ3
No ratings yet
Ip MCQ3
5 pages
5 - Sorting Algorithms
No ratings yet
5 - Sorting Algorithms
90 pages
Ia-2 - Ai - QP 2024
No ratings yet
Ia-2 - Ai - QP 2024
3 pages
A Novel Transfer Learning Based Approach For Plant Species
No ratings yet
A Novel Transfer Learning Based Approach For Plant Species
14 pages
ICT (Source Encoding & Channel Encoding)
No ratings yet
ICT (Source Encoding & Channel Encoding)
15 pages
Elektor 1955
No ratings yet
Elektor 1955
11 pages
Bresenham LDA
No ratings yet
Bresenham LDA
3 pages
Least Square Method
No ratings yet
Least Square Method
23 pages
MCQ Unit 3
No ratings yet
MCQ Unit 3
13 pages
Embedded System Lab Question Paper
No ratings yet
Embedded System Lab Question Paper
4 pages
Prelim Exam 22-23 DSBDA
No ratings yet
Prelim Exam 22-23 DSBDA
2 pages
1377857244CA Final - AMA - LPP
No ratings yet
1377857244CA Final - AMA - LPP
19 pages
Motor Fault Detection Using Sound Signature and Wavelet Transform
No ratings yet
Motor Fault Detection Using Sound Signature and Wavelet Transform
9 pages
OR
No ratings yet
OR
31 pages
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
No ratings yet
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
7 pages
Complexity Analysis of Quick sort
No ratings yet
Complexity Analysis of Quick sort
10 pages
DSA5105 Lecture5
No ratings yet
DSA5105 Lecture5
52 pages