0% found this document useful (0 votes)
11 views

6 Lec SVM Kernel

Uploaded by

ĂmÑa CheEma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

6 Lec SVM Kernel

Uploaded by

ĂmÑa CheEma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Support vector machines (SVMs)

Dr. Saifullah Khalid


[email protected]
Slides Credit: Mostly based on UofT intro to machine learning course
Sequence

‫ ﺓ‬Support vector machine (SVM)


‫ ﺓ‬Optimal separating hyper planes
‫ ﺓ‬Non-seperable data

‫ ﺓ‬Kernel Method
‫ ﺓ‬Dual formulation of SVM
‫ ﺓ‬Inner product of kernels

2
Separating Hyperplane?
Separating Hyperplane?
Separating Hyperplane?
Support Vector Machine (SVM)

Support vectors

Maximize
• SVMs maximize the margin (or the margin
street) around the separating hyperplane
• The decision function is fully specified
by a (usually very small) subset of
training samples, the support vectors
Support Vectors

d
X X
v1
v2

X X
v3

X
X

Three support vectors: v1, v2, v3, instead of just the 3 circled points at the tail ends of the
support vectors. d denotes 1/2 of the street ‘width’
Optimal Separating Hyperplane

‫ ﺓ‬Optimal Separating Hyperplane: A hyperplane that


separates two classes and maximizes the distance to the
closest point from either class, i.e., maximize the margin of
the classifier

‫ ﺓ‬Intuitively, ensuring that a classifier is not too close to any


data points leads to better generalization on the test data.
Geometry of Points and Planes
Geometry of Points and Planes
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem

Algebraic max-margin objective:

‫ ﺓ‬This is a Quadratic Program: Quadratic objective + Linear inequality constraints.

‫ ﺓ‬The important training examples are the ones with algebraic margin 1, and are
called support vectors

‫ ﺓ‬Hence, this algorithm is called the (hard) Support Vector Machine (SVM)

‫ ﺓ‬SVM-like algorithms are often called max-margin or large-margin


Non-Separable Data Points

‫ ﺓ‬How can we apply the max-margin principle if the


data are not linearly separable?
Maximizing Margin for Non-Separable Data
Points

Main Idea: ‫ ﺓ‬Allow some points to be within the margin or even be


misclassified; we represent this with slack variables ξi.
‫ ﺓ‬But constrain or penalize the total amount of slacks
Maximizing Margin for Non-Separable Data Points
Maximizing Margin for Non-Separable Data Points
‫ ﺓ‬Soft-margin SVM objective:

• 𝛾 is a hyper parameter that trades off the margin with the


amount of slack.
► For 𝛾 = 0, we’ll get 𝒘 = 0. (Why?)
► As 𝛾 → ∞ we get the hard-margin objective.
• Note: It is also possible to constrain 𝑖 𝜉𝑖 instead of penalizing it
From Margin Violation to Hinge Loss
Let’s simplify the soft margin constraint by eliminating ξi.

Recall: 𝑡 𝑖 𝒘𝑇𝒙𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ∀𝑖 ∈ 𝑁
𝜉𝑖 ≥ 0 ∀𝑖 ∈ 𝑁

‫ ﺓ‬We would like to find a smallest slack variable ξi that satisfy both
𝜉𝑖 ≥ 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 and 𝜉𝑖 ≥ 0
‫ ﺓ‬Case 1: 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 ≤ 0
The smallest non-negative ξi that satisfies the constraint is 𝜉𝑖 = 0
‫ ﺓ‬Case 2: 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 > 0
The smallest 𝜉𝑖 that satisfies the constraint is 𝜉𝑖 = 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏
‫ ﺓ‬Hence, 𝜉𝑖 = max {0, 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 }
‫ ﺓ‬Therefore, the slack penalty can be written as
𝑁 𝑁

𝜉𝑖 = 𝑚𝑎𝑥 {0, 1 − 𝑡 𝑖 𝑤 𝑇 𝑥 𝑖 + 𝑏 }
𝑖 𝑖
From Margin Violation to Hinge Loss
Kernel Methods
or
Kernel Trick
Nonlinear Decision Boundaries

‫ ﺓ‬SV Classifier: Margin maximizing linear classifier


‫ ﺓ‬Linear models are restrictive
‫ ﺓ‬Q: How can we get nonlinear decision boundaries?
‫ ﺓ‬Feature mapping 𝒙 → 𝜑(𝒙)

‫ ﺓ‬Q: How do we find good features?


Feature Maps

‫ ﺓ‬For a quadratic decision boundary


‫ ﺓ‬What feature mapping do we need?

‫ ﺓ‬One possibility (ignore √2 for now)

‫ ﺓ‬We have dim 𝜑 𝒙 = 𝑂 𝑑2 ; in a high dimension, the


computation cost might be large

‫ ﺓ‬Can we avoid the high computation cost?

‫ ﺓ‬Let us take a closer look at SVM


From Primal to Dual Formulation of SVM
‫ ﺓ‬Recall that the SVM is defined using the following constrained
optimization problem:

‫ ﺓ‬We can instead solve a dual optimization problem to obtain 𝒘


► We do not derive it here in detail. The basic idea is to form the following
Lagrangian, find w as a function of 𝛼 (and other variables), and express the
Lagrangian only in terms of the dual variables:
From Primal to Dual Formulation of SVM
‫ ﺓ‬Primal Optimization Problem:

‫ ﺓ‬Dual Optimization Problem:

‫ ﺓ‬The weights become:

which is a function of the dual variables 𝛼𝑖 ∀𝑖 ∈ 𝑁


From Primal to Dual Formulation of SVM
‫ ﺓ‬Dual Optimization Problem:

‫ ﺓ‬The weights become:

‫ ﺓ‬The non-zero weights 𝛼i corresponds to observations that satisfy


𝑡 𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 = 1 − 𝜉𝑖 . These are the support vectors

‫ ﺓ‬Observation: The input data only appears in the form of inner


products 𝒙𝑖 𝒙𝑗
SVM in Feature Space
From Inner Products to Kernels
From Inner Products to Kernels
Kernels
Kernelizing SVM
Example: Linear SVM

• Solid line - decision boundary. Dashed - +1/-1 margin. Purple - Bayes optimal
• Solid dots - Support vectors on margin
Example: Degree-4 Polynomial Kernel SVM
Example: Gaussian Kernel SVM

You might also like