0% found this document useful (0 votes)
2 views

Svm

Uploaded by

Shivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Svm

Uploaded by

Shivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

SUPPORT VECTOR

MACHINE
Prof. Subodh Kumar Mohanty
The idea of support vectors and its
importance
Introduction
• The support vector machine is currently considered to be the best
off-the-shelf learning algorithm and has been applied successfully in various
domains.
• Support vector machines were originally designed for binary classification.
• Then, it is extended to solve multi-class and regression problems.
• But, it is widely used in classification objectives.
• The objective of the support vector machine algorithm is to find a hyperplane
in an N-dimensional space (N — the number of features) that distinctly
classifies the data points.
• To separate the two classes of data points, there are many possible
hyperplanes that could be chosen. Our objective is to find a plane that
has the maximum margin, i.e the maximum distance between data
points of both classes.

Possible
hyperplanes
• Maximizing the margin distance provides some reinforcement so that
future data points can be classified with more confidence.

Possible
hyperplanes
• Maximizing the margin distance provides some reinforcement so that
future data points can be classified with more confidence.
• The goal is to choose a hyperplane with the greatest possible margin
between the hyperplane and any point within the training set, giving a
greater chance of new data being classified correctly.
• Hyperplanes are decision boundaries that help classify the data points.
• Data points falling on either side of the hyperplane can be attributed to
different classes.
• Also, the dimension of the hyperplane depends upon the number of
features.
• It becomes difficult to imagine when the number of features exceeds 3.
• Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane.
• Using these support vectors, we maximize the margin of the classifier.
• Deleting the support vectors will change the position of the hyperplane.
• These are the points that help us build our SVM (that works for a
new/test data).

test data point test data point


(will be predicted as (will be predicted as
square) circle)
• But what happens when there is no clear hyperplane?
• A dataset will often look more like the jumbled balls below which
represent a linearly non separable dataset.

• In order to classify a dataset like the one above it’s necessary to


move away from a 2d view of the data to a 3d view.
• Explaining this is easiest with another simplified example.
• Imagine that our two sets of colored balls above are sitting on a sheet
and this sheet is lifted suddenly, launching the balls into the air.
• While the balls are up in the air, you use the sheet to separate them.
• This ‘lifting’ of the balls represents the mapping of data into a higher
dimension.

• This is known as kernelling.


Pros & Cons of Support Vector Machines
Pros
• Accuracy
• Works well on smaller cleaner datasets
• It can be more efficient because it uses a subset of training points

Cons
• Isn’t suited to larger datasets as the training time with SVMs can be high
• Less effective on noisier datasets with overlapping classes
Applications
• SVM is used for text classification tasks such as category assignment,
detecting spam and sentiment analysis.
• It is also commonly used for image recognition challenges,
performing particularly well in aspect-based recognition and
color-based classification.
• SVM also plays a vital role in many areas of handwritten digit
recognition, such as postal automation services.
Derivation of Support Vector Equation
Comparison with logistic regression
Training set:
m examples
Sigmoid function

Threshold classifier output at 0.5:


If , predict “y = 1”
How to choose parameters ? If , predict “y = 0”
Max Likelihood Estimation (already discussed)
Comparison with logistic regression
• In SVM, we take the output of the
linear function and if that output is
greater than 1, we identify it with one
class and if the output is less than -1,
we identify it with another class.
Sigmoid function
• Since the threshold values are
changed to 1 and -1 in SVM, we
obtain this reinforcement range of
values ([-1,1]) which acts as margin.
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
Comparison with logistic regression
• In SVM, we take the output of the
linear function and if that output is
greater than 1, we identify it with one
class and if the output is less than -1,
we identify it with another class.
• Since the threshold values are
changed to 1 and -1 in SVM, we
obtain this reinforcement range of
values ([-1,1]) which acts as margin.

• g(x) is a linear discriminant function that divides (categorizes) into two


decision regions.
• The generalization of the linear discriminant function for an
n-dimensional feature space in is straight forward:

• The discriminant function is now a linear n-dimensional surface,


called a hyperplane; symbolized as
• A two-category classifier implements the following decision rule:
Decide Class 1 if g(x) > 0 and Class 2 if g(x) < 0
• Thus, x is assigned to Class 1 if the inner product wTx exceeds the
threshold (bias) –w0, and to Class 2 otherwise.
• Figure shows the architecture of a typical implementation of the
linear classifier.
• It consists of two computational units: an aggregation unit and an
output unit.

A simple linear classifier


Linear decision boundary between two classes


Algebraic measure of the


distance from x to the

Geometry for 3-dimensions (n=3)
Linear Maximal Margin Classifier for Linearly
Separable Data
• For linearly separable, many hyperplanes exist to perfrom separation.
• SVM framework tells which hyperplane is best.
• Hyperplane with the largest margin which minimizes training error.
• Select the decision boundary that is far away from both the classes.
• Large margin separation is expected to yield good generalization.
• in wTx + w0 = 0, w defines a direction perpendicular to the
hyperplane.
• w is called the normal vector (or simply normal) of the hyperplane.
• Without changing the normal vector w, varying w0 moves the
hyperplane parallel to itself.
test data point test data point
(will be predicted as (will be predicted as
square) circle)

Large margin and small margin seperation


Geometric interpretation of algebraic distances of points to a hyperplane for
two-dimensional case

KKT Condition
Learning problem in SVM

Hard margin svm Vs Soft margin svm



Linear Soft Margin Classifier for Overlapping
Classes
• To generalize SVM, allow noise in the training data.
• Hard margin linear SVM algorithm will not work.

Soft decision boundary






Kernel Function: Dealing with non linearity
Non-linear classifiers
• for several real-life datasets, the decision boundaries are nonlinear.
• To deal with nonlinear case, the formulation and solution methods
employed for the linear case are still applicable.
• Only input data is transformed from its original space into another space
(higher dimensional space) so that a linear decision boundary can separate
Class 1 examples from Class 2.
• The transformed space is called the feature space.
• The original data space is known as the input space.
Non-linear classifiers
• For training examples which cannot be linearly separated.
• In the feature space, they can be separated linearly with some
transformations.

Transformation from input space to feature space





Mercer’s theorem

Polynomial and Radial Basis Kernel

Polynomial Kernel
Polynomial Kernel
• The polynomial kernel represents the similarity of vectors (training
samples) in a feature space over polynomials of the original variables,
allowing learning of non-linear models.
• It looks not only at the given features of input samples to determine their
similarity, but also combinations of these (interaction features).
• Quite popular in natural language processing (NLP).
• The most common degree is d = 2 (quadratic), since larger degrees tend to
overfit on NLP problems.
• One problem with the polynomial kernel is that it may suffer from
numerical instability: (result ranges from 0 to infinity)
Radial Basis Kernel

Radial Basis Kernel
• The maximum value that the RBF kernel can get is 1 and occurs when d₁₂
is 0 which is when the points are the same, i.e. X₁ = X₂.
• When the points are the same, there is no distance between them and
therefore they are extremely similar.
• When the points are separated by a large distance, then the kernel value
is less than 1 and close to 0 which would mean that the points are
dissimilar.
• There are no golden rules for determining which admissible kernel will
result in the most accurate SVM.
• In practice, the kernel chosen does not generally make a large difference in
resulting accuracy.
• SVM training always finds a global solution, unlike neural networks (to be
discussed in the next chapter) where many local minima usually exist.

You might also like