0% found this document useful (0 votes)
18 views

SVM SLIDES

Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression, effective for complex datasets. Developed in the early 1990s, SVMs utilize hyperplanes and support vectors to optimize classification margins, and they can handle non-linear data through the kernel trick. Applications of SVM span various fields, including image and text classification, finance, and bioinformatics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

SVM SLIDES

Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression, effective for complex datasets. Developed in the early 1990s, SVMs utilize hyperplanes and support vectors to optimize classification margins, and they can handle non-linear data through the kernel trick. Applications of SVM span various fields, including image and text classification, finance, and bioinformatics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Support Vector Machines

Dr.G.JOHN BABU

November 3, 2024

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 1 / 32


Classification Problem

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 2 / 32


Classification Problem

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 3 / 32


Classification Problem

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 4 / 32


Classification Problem

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 5 / 32


Support Vector Machines

Support Vector Machines (SVM) are powerful supervised learning


algorithms used for classification and regression.
Effective for complex datasets where traditional linear classifiers may
fail.
Gained popularity due to robustness and versatility across various
applications.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 6 / 32


History and Context

Developed by Vladimir Vapnik in the early 1990s.


Originated from statistical learning theory.
Initially designed for binary classification by finding an optimal
hyperplane.
Kernel trick introduced in the late 1990s expanded SVM’s capabilities
to non-linear classification.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 7 / 32


Principles of Support Vector Machines

Hyperplane: A subspace that divides the space into two half-spaces.


Margin: Distance between the hyperplane and the closest data points
of each class.
Support Vectors: Critical data points that influence the position of
the hyperplane.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 8 / 32


Optimization Problem Formulation

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 9 / 32


Applications of Support Vector Machines

SVMs have numerous applications across fields:


Image Classification: For tasks such as face detection and object
recognition.
Text Classification: Useful for spam detection, sentiment analysis.
Bioinformatics: Applied in gene expression and protein structure
prediction.
Finance: Credit scoring, fraud detection.
Handwriting Recognition: Enhancing OCR accuracy.
Speech Recognition: Classifying spoken words or phrases.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 10 / 32


Optimal Separation

Optimal separation aims to find the best hyperplane that maximizes the
margin between two classes, improving the generalization of the classifier.
A larger margin reduces the likelihood of misclassification for new
data points.
Support vectors are crucial in defining the optimal hyperplane.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 11 / 32


Mathematical Formulation

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 12 / 32


Mathematical Formulation for Optimal Separation

minimize 12 ||w||2
subject to yi (w · xi + b) ≥ 1, ∀i
where:
w and b define the hyperplane,
yi denotes class labels ensuring points of one class yield positive
results and the other yields negative results.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 13 / 32


Lagrange Multipliers

Lagrange multipliers are a method in optimization to maximize or


minimize a function under certain constraints.
Objective Function: The function to be optimized.
Constraints: The conditions that must be met.
Lagrangian Function:

L(x, y , λ) = f (x, y ) + λg (x, y )

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 14 / 32


Example with Lagrange Multipliers

Maximize f (x, y ) = xy subject to g (x, y ) = x + y − 10 = 0.


Lagrangian Function:

L(x, y , λ) = xy + λ(10 − x − y )

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 15 / 32


Partial Derivatives of the Lagrangian

Taking partial derivatives:


∂L
=y −λ=0
∂x
∂L
=x −λ=0
∂y
∂L
= 10 − x − y = 0
∂λ
From these, we find x = 5, y = 5.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 16 / 32


To visualize the concept of optimal separation, consider a simple example
with two classes represented by blue and red points in a two-dimensional
space.
Suppose the data points can be separated by multiple lines. Some
lines may run closely against the data points, while others may run
further away. The line that runs equidistant from the nearest points
of both classes is the optimal hyperplane.
If we visualize the distance from the hyperplane to the nearest points
of each class, the goal is to maximize this distance. This ensures that
there is greater separation between classes, reducing the likelihood of
misclassification.
By selecting the hyperplane that maximizes the margin, the SVM is
more robust against noise and can handle small variations in new data
points effectively.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 17 / 32


Illustrative Example

To visualize the concept of optimal separation, consider a simple example


with two classes represented by blue and red points in a two-dimensional
space.
Suppose the data points can be separated by multiple lines. Some
lines may run closely against the data points, while others may run
further away. The line that runs equidistant from the nearest points
of both classes is the optimal hyperplane.
If we visualize the distance from the hyperplane to the nearest points
of each class, the goal is to maximize this distance. This ensures that
there is greater separation between classes, reducing the likelihood of
misclassification.
By selecting the hyperplane that maximizes the margin, the SVM is
more robust against noise and can handle small variations in new data
points effectively.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 18 / 32


A Constrained Optimization Problem

we define a set of constraints where the classifier should make correct


predictions. By assigning the target values for two classes as ±1 instead of
0 and 1, we can write down the product of the target ti and the predicted
output yi . This product will be positive if the predicted class matches the
target, and negative otherwise.
Thus, we can formulate the classifier’s condition as:

ti (w T xi + b) ≥ 1

ensuring correct classification.


The full optimization problem is then:
1
min w T w subject to ti (w T xi + b) ≥ 1 ∀i = 1, . . . , n.
2
This optimization problem involves minimizing the norm of the weight
vector w , while ensuring that each datapoint satisfies the given constraint.
Dr.G.JOHN BABU Support Vector Machines November 3, 2024 19 / 32
Quadratic Programming Solution

Quadratic Programming method, it is both quadratic (involving the


square of the weight vector) and convex (the minimization problem has a
unique solution).
The Karush–Kuhn–Tucker (KKT) conditions define the optimal
solution as follows for all values of i:
λ∗i (1 − ti (w ∗T xi + b ∗ )) = 0
1 − ti (w ∗T xi + b ∗ ) ≤ 0
λ∗i ≥ 0
Here, λi are Lagrange multipliers, which allow us to solve constrained
optimization problems. The first condition implies that if λi = 0, then
ti (w ∗T xi + b ∗ ) = 1, meaning that the constraint holds as an equality for
support vectors. These support vectors lie on the boundary of the
margin, and their constraints hold as equalities, reducing the number of
datapoints that need to be considered.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 20 / 32


Lagrangian Function

We define the Lagrangian for the problem as:


n
1 X
L(w , b, λ) = w T w + λi (1 − ti (w T xi + b)).
2
i=1

Differentiating this with respect to w and b, we obtain:


n
∂L X
=w− λi ti xi ,
∂w
i=1

n
∂L X
=− λi ti .
∂b
i=1

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 21 / 32


Setting these derivatives equal to zero gives us the optimal values for w
and b:
X n

w = λi ti xi ,
i=1
n
X
λi ti = 0.
i=1

Substituting these values into the Lagrangian function yields the dual
problem, where we aim to maximize the following with respect to λi :
n n n
∗ ∗
X 1 XX
L(w , b , λ) = λi − λi λj ti tj xiT xj ,
2
i=1 i=1 j=1
Pn
subject to λi ≥ 0 and i=1 λi ti = 0.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 22 / 32


Slack Variables for Non-Linearly Separable Problems
In the case where the dataset is non-linearly separable, we introduce
slack variables ηi ≥ 0 to relax the constraints:
ti (w T xi + b) ≥ 1 − ηi .
Here, ηi = 0 for correctly classified points, and ηi > 0 for misclassified
points.
The objective function now becomes:
Xn
L(w , ξ) = w T w + C ηi ,
i=1
where C is a parameter that balances the trade-off between minimizing the
classification error and maximizing the margin. This transforms the
classifier into a soft-margin classifier.
The KKT conditions for this problem are:
λ∗i (1 − ti (w ∗T xi + b ∗ ) − ηi ) = 0

Pn− λi∗)ηi = 0
(C
i=1 λi ti = 0.
Finally, we compute
Dr.G.JOHN BABU the optimal bias
Support b ∗Machines
Vector by averaging over the 3,support
November 2024 23 / 32
Prediction for a New Data Point

For a new point z, the prediction can be made using:


n
X
w ∗ z + b∗ = λi ti xiT z + b ∗ .
i=1

Thus, classification of a new point involves computing the inner product


between the point and the support vectors.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 24 / 32


Kernel Trick in SVM

The kernel trick in SVM is used to handle data that is not linearly
separable by transforming it into a higher-dimensional space.
Instead of computing this transformation directly, the kernel trick
allows us to compute the inner product of transformed vectors in the
original space.
This reduces computational complexity, making algorithms efficient
even with complex mappings.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 25 / 32


Need of Kernel in SVM

When we cannot linearly separate data in the original feature space,


modifying the features can help. This idea is similar to the XOR problem
we encountered earlier. By transforming the data into a
higher-dimensional space, we might find a linear decision boundary that
separates the classes. To achieve this, we introduce new functions ϕ(x)
based on the input features.
The key idea is to transform the input xi into a new form ϕ(xi ), while still
being able to use the SVM algorithm. Specifically, Equation remains valid,
but with xi replaced by ϕ(xi ). The resulting prediction equation becomes:
n
X
wT x + b = λi ti ϕ(xi )T ϕ(z) + b.
i=1

The choice of functions ϕ(x) is critical.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 26 / 32


How the Kernel Trick Works

1 Problem with Non-Linear Data: In SVM, we want to classify data


with a separating hyperplane, but non-linear data cannot be separated
in the original space.
2 Mapping to a Higher-Dimensional Space: A transformation
function ϕ(x) maps each data point to a higher-dimensional space,
making separation possible. However, directly computing ϕ(x) is
computationally expensive.
3 Role of the Kernel Trick: The kernel function K (x, y ) = ϕ(x) · ϕ(y )
allows us to compute similarity between data points in the original
space, avoiding direct computation in higher dimensions.
4 Computational Advantage: This avoids high-dimensional
transformation calculations, reducing computational load.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 27 / 32


Polynomial Kernel

The polynomial kernel function can map input data into polynomial
feature space:
K (x, y ) = (x · y + c)d
where:
d is the degree of the polynomial.
c is a constant, controlling the influence of higher-dimensional features.
By applying a polynomial kernel, we can capture interactions of
features up to the d-th degree, helping classify data that has
non-linear relationships.
Example: A polynomial kernel of degree 2 can separate data that
requires a quadratic boundary.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 28 / 32


Sigmoid Kernel in SVM

The sigmoid kernel function is defined as:

K (x, y ) = tanh(α (x · y ) + c)

where α and c control the shape of the hyperplane.


Inspired by neural networks, the sigmoid kernel behaves similarly to
neuron activation functions.
Example: With appropriate parameters, the sigmoid kernel maps
data to a curved decision boundary, capturing non-linear relationships.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 29 / 32


Radial Basis Function (RBF) Kernel in SVM

The RBF kernel (or Gaussian kernel) is widely used for non-linear
classification:
∥x − y ∥2
 
K (x, y ) = exp −
2σ 2
where σ determines the spread of the kernel.
Measures the ”distance” between points, with closer points having
higher similarity.
Example: For data forming concentric circles, the RBF kernel allows
SVM to classify these clusters by mapping them into separable
regions in the transformed space.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 30 / 32


Mercer’s Theorem

Mercer’s theorem is fundamental in validating the use of kernels in


SVM.
It states that a function K (x, y ) is a valid kernel if it corresponds to
an inner product in some higher-dimensional space.
This means that if K (x, y ) is positive semi-definite, it can be used as
a kernel in SVM.
Implication: Mercer’s theorem ensures that we can apply the kernel
trick with confidence, knowing that K (x, y ) represents a genuine
inner product.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 31 / 32


Summary of Kernel Trick

The kernel trick allows SVM to compute similarity directly in the


input space, reducing computational complexity.
Key Kernels:
Polynomial Kernel: Maps data into polynomial feature space.
Sigmoid Kernel: Inspired by neural networks, creates curved
boundaries.
RBF Kernel: Suitable for data with clusters or curved boundaries.
The choice of kernel function depends on the data structure and
separation required for effective classification.

Dr.G.JOHN BABU Support Vector Machines November 3, 2024 32 / 32

You might also like