0% found this document useful (0 votes)
7 views

SVM_NEW

Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression, known for their effectiveness with complex datasets. Developed by Vladimir Vapnik in the 1990s, SVMs utilize the concept of finding an optimal hyperplane that maximizes the margin between classes, employing techniques like the kernel trick for non-linear problems. SVMs have diverse applications, including image and text classification, bioinformatics, finance, and speech recognition.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

SVM_NEW

Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression, known for their effectiveness with complex datasets. Developed by Vladimir Vapnik in the 1990s, SVMs utilize the concept of finding an optimal hyperplane that maximizes the margin between classes, employing techniques like the kernel trick for non-linear problems. SVMs have diverse applications, including image and text classification, bioinformatics, finance, and speech recognition.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit-3 SUPPORT VECTOR MACHINES

Dr. John Babu

Support Vector Machines


Introduction
Support Vector Machines (SVM) are powerful supervised learning algorithms used for classification
and regression tasks. They are particularly well-suited for tasks involving complex datasets where
traditional linear classifiers may fail to deliver satisfactory results. SVMs have become one of the
most popular methods in machine learning due to their effectiveness and robustness in various
applications.

History and Context


The SVM algorithm was developed by Vladimir Vapnik and his colleagues in the early 1990s. The
roots of SVM can be traced back to statistical learning theory, which Vapnik significantly con-
tributed to. The original version of the SVM was designed to solve binary classification problems
by finding the optimal hyperplane that separates two classes of data points in a high-dimensional
space.
The introduction of the kernel trick in the late 1990s expanded the applicability of SVMs by
allowing them to handle non-linear classification problems. By using kernel functions, SVMs can
effectively map the input data into higher dimensions, enabling the algorithm to find a hyperplane
that can separate classes that are not linearly separable in the original input space.

Principle of Support Vector Machines


The main idea behind SVM is to find the optimal hyperplane that maximizes the margin between
two classes. Here are the key concepts:
Hyperplane: In an n-dimensional space, a hyperplane is a flat affine subspace of dimension
n-1. For example, in a 2D space, a hyperplane is a line, and in 3D, it is a plane.
Margin: The margin is defined as the distance between the hyperplane and the closest data
points from either class. The goal of SVM is to maximize this margin, ensuring that the hyperplane
is as far away from the nearest points of both classes as possible.
Support Vectors: These are the data points that lie closest to the hyperplane and are critical
for defining the position and orientation of the hyperplane. Only the support vectors influence
the location of the hyperplane, while the other data points do not affect it.
The SVM algorithm can be summarized in the following steps:
1. Transform the Data: If the data is not linearly separable, kernel functions can be used
to transform the data into a higher-dimensional space.
2. Find the Optimal Hyperplane: The SVM algorithm finds the hyperplane that maxi-
mizes the margin by solving a quadratic optimization problem.
3. Classification: Once the hyperplane is determined, new data points can be classified
based on which side of the hyperplane they fall on.

1
Applications of Support Vector Machines
SVMs have a wide range of applications across various fields due to their robustness and accuracy.
Some notable applications include:
- Image Classification: SVMs are widely used for classifying images in computer vision tasks,
such as face detection and object recognition.
- Text Classification: SVMs are effective in categorizing text documents, such as spam detection
in emails or sentiment analysis.
- Bioinformatics: SVMs are applied in gene expression classification and protein structure
prediction, helping in medical diagnosis and treatment planning.
- Finance: In finance, SVMs are used for credit scoring, risk assessment, and fraud detection
by analyzing historical transaction data.
- Handwriting Recognition: SVMs have been employed in recognizing handwritten characters
and digits, enhancing the accuracy of Optical Character Recognition (OCR) systems.
- Speech Recognition: SVMs are also used in speech recognition systems to classify spoken
words or phrases.
Support Vector Machines are a versatile and powerful tool in the machine learning toolkit,
known for their ability to handle both linear and non-linear classification problems effectively.
Their theoretical foundation and practical applications make them a key algorithm for researchers
and practitioners in various domains.

Optimal Separation
In the context of Support Vector Machines SVM, optimal separation refers to the process of finding
the best hyperplane that divides two classes of data in such a way that maximizes the margin
between them. The margin is defined as the distance between the hyperplane and the nearest
data points from each class, which are known as support vectors.
The goal of optimal separation is to create a decision boundary that not only separates the
two classes but does so in a manner that minimizes the chance of misclassification when new data
points are introduced. The optimization problem can be expressed mathematically as:

minimize 12 ||w||2
subject to ti (w · xi + b) ≥ 1, ∀i
where w is the weight vector, b is the bias term, ti is the class label for data point i, and xi is the
feature vector of data point i.

Principles of Optimal Separation


1. Hyperplane: In a multidimensional space, a hyperplane is a flat affine subspace that separates
the space into two half spaces. In two dimensions, a hyperplane is simply a line, while in three
dimensions, it is a plane.
2. Margin: The margin is the gap between the hyperplane and the nearest points from either
class. The larger the margin, the better the separation. This is crucial for ensuring that the
classifier can generalize well to new, unseen data.
3. Support Vectors: These are the data points that lie closest to the hyperplane. The position
of the hyperplane is determined solely by these points. Other points do not affect its location.
The support vectors are essential in defining the optimal hyperplane.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 2


Mathematical Formulation
To find the optimal hyperplane, we aim to solve the following optimization problem:

minimize 12 ||w||2
subject to ti (w · xi + b) ≥ 1, ∀i
In this formulation:
- w is the weight vector that defines the orientation of the hyperplane.
- b is the bias term that shifts the hyperplane away from the origin.
- ti is the class label for the training data points xi .

The condition ensures that points of one class yield positive results while points of the other
class yield negative results when plugged into the equation.

Solution By Quadratic Programming


Quadratic programming (QP) is a type of optimization problem where the objective function is
quadratic (meaning it includes terms squared, such as x2 or 21 ||w||2 , and the constraints are linear.
QP is widely used in constrained optimization because it effectively handles situations where the
cost or performance metric is not just linearly related to variables but has a certain curvature,
reflecting diminishing returns, risk minimization, or other nonlinear behaviors in economic and
engineering applications.

Lagrange Multipliers
Lagrange multipliers are a method used in optimization to find the local maxima and minima of
a function subject to equality constraints. This technique is particularly useful when we need to
optimize a function while ensuring that certain conditions (constraints) are met.

Basic Concept of Lagrange Multipliers


ˆ Objective Function: This is the function we want to maximize or minimize, denoted as
f (x, y).

ˆ Constraints: These are the conditions that must be satisfied, denoted as g(x, y) = 0.

ˆ Lagrangian Function: The Lagrangian incorporates both the objective function and the
constraint. It is defined as:

L(x, y, λ) = f (x, y) + λg(x, y)

Here, λ is the Lagrange multiplier.

Steps to Use Lagrange Multipliers


1. Set up the Lagrangian.

2. Take partial derivatives of the Lagrangian with respect to each variable (including the mul-
tiplier).

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 3


3. Set the derivatives equal to zero to find critical points.
4. Solve the resulting system of equations to find the values of the variables and the multiplier.

Simple Example
Let’s illustrate this with a basic example.

Example Problem
Maximize the function f (x, y) = xy subject to the constraint g(x, y) = x + y − 10 = 0.

Step 1: Set up the Lagrangian


The Lagrangian function for this problem is:
L(x, y, λ) = xy + λ(10 − x − y)

Step 2: Take Partial Derivatives


Now, we take partial derivatives of the Lagrangian:
1. With respect to x:
∂L
= y − λ = 0 (1)
∂x
2. With respect to y:
∂L
= x − λ = 0 (2)
∂y
3. With respect to λ:
∂L
= 10 − x − y = 0 (3)
∂λ

Step 3: Set Derivatives Equal to Zero


From equations (1) and (2), we can express λ: - From (1): λ = y - From (2): λ = x
Setting the two equal gives:
y = x (4)

Step 4: Solve the Constraint


Substituting equation (4) into the constraint (3):
10 − x − y = 010 − x − x = 010 − 2x = 02x = 10x = 5
Since y = x, we also have:
y=5

Conclusion: Optimal Solution


The values that maximize the function f (x, y) = xy under the given constraint are:
x = 5, y=5
Maximum Value:
f (5, 5) = 5 × 5 = 25

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 4


Summary
In this example, we used Lagrange multipliers to find the maximum value of the function xy under
the constraint that x + y = 10. The method allowed us to transform a constrained optimization
problem into a system of equations that could be solved easily. This technique is powerful for more
complex optimization problems encountered in various fields, including economics, engineering,
and machine learning.

A Constrained Optimization Problem


In order to determine the effectiveness of a classifier, we can establish that fewer classification
errors indicate a better model. To mathematically express this, we define a set of constraints
where the classifier should make correct predictions. By assigning the target values for two classes
as ±1 instead of 0 and 1, we can write down the product of the target ti and the predicted
output yi . This product will be positive if the predicted class matches the target, and negative
otherwise. Therefore, we can formulate the classifier’s condition as ti (wT xi + b) ≥ 1, ensuring
correct classification.
The full optimization problem is then:
1
min wT w subject to ti (wT xi + b) ≥ 1 ∀i = 1, . . . , n.
2
This optimization problem involves minimizing the norm of the weight vector w, while ensuring
that each datapoint satisfies the given constraint.

Primal Problem in SVM


The Primal Problem is the initial formulation of the optimization problem in SVM. Its objective
is to find a hyperplane that maximally separates data from two classes.
The hyperplane is defined as:
f (x) = wT x + b
where w is the weight vector, and b is the bias term.
For linear separable data, the SVM aims to:
ˆ Maximize the margin (distance between the separating hyperplane and the closest points
from each class).

ˆ Minimize classification error.

The primal problem is set up as follows:

Objective Function:
1
min ∥w∥2
w,b 2
where ∥w∥2 = wT w (the squared norm of the weight vector).

Constraints: For all data points (xi , yi ) where yi is the label (+1 or -1):

yi (wT xi + b) ≥ 1

These constraints ensure that each point is correctly classified and at least 1 unit away from
the decision boundary.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 5


Quadratic programming efficiently solves problems like this in polynomial time. The advantage
is that convex problems, like this one, have a unique minimum. The Karush–Kuhn–Tucker
(KKT) conditions define the optimal solution as follows for all values of i:

λi (1 − ti (wT xi + b)) = 0
1 − ti (wT xi + b) ≤ 0
λi ≥ 0
Here, λi are Lagrange multipliers, which allow us to solve constrained optimization prob-
lems.
The first condition implies that if λi = 0, then ti (w∗T xi + b∗ ) = 1, meaning that the constraint
holds as an equality for support vectors. These support vectors lie on the boundary of the
margin, and their constraints hold as equalities, reducing the number of datapoints that need to
be considered.

Lagrangian Function
We define the Lagrangian for the problem as:
n
1 X
L(w, b, λ) = wT w + λi (1 − ti (wT xi + b)).
2 i=1

Differentiating this with respect to w and b, we obtain:


n
∂L X
=w− λi ti xi ,
∂w i=1

n
∂L X
=− λi ti .
∂b i=1

Setting these derivatives equal to zero gives us the optimal values for w and b:
n
X

w = λi ti xi
i=1

n
X
λi ti = 0.
i=1

Substituting these values into the Lagrangian function yields the dual problem, where we aim
to maximize the following with respect to λi :
n n n
∗ ∗
X 1 XX
L(w , b , λ) = λi − λi λj ti tj xTi xj ,
i=1
2 i=1 j=1
Pn
subject to λi ≥ 0 and i=1 λi ti = 0.

Derivation of b for SVM


To understand how the equation for b is formed, let’s break down the process step-by-step. This
derivation is performed in the context of Support Vector Machines (SVM), where we aim to find
the optimal separation boundary between two classes.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 6


Support Vector Condition: For any support vector xj , the optimal separation condition is
met exactly:
tj wT xj + b = 1


where: - tj is the label of the support vector (either +1 or −1). - wT xj + b represents the linear
boundary applied to the support vector xj .
Expressing w Using Lagrange Multipliers: In terms of the support vectors, we can write
w as: n
X
w= λi ti xi
i=1
where: - λi are the Lagrange multipliers. - xi are the training data points. - Only points with
λi > 0 are support vectors and contribute to w.
Substitute w in the Support Vector Condition: Substitute w into the support vector
condition:  !T 
Xn
tj  λi ti xi x j + b = 1
i=1

Expanding this gives: !


n
X
tj λi ti xTi xj + b =1
i=1
Since tj is +1 or −1, we can divide both sides by tj to isolate b:
n
X
λi ti xTi xj + b = tj
i=1

Rearranging, we obtain:
n
X
b = tj − λi ti xTi xj
i=1
Averaging Over All Support Vectors: Since any support vector xj should satisfy this
equation, we compute b for each support vector and then average to obtain a stable value. With
Ns support vectors, this becomes:
n
!
1 X X
b= tj − λi ti xTi xj
Ns support vectors j i=1

which simplifies to:


n
1 X X
b = tj − λi ti xTi xj
Ns support vectors j i=1

This averaging approach gives a more robust calculation of b across all support vectors, rather
than relying on a single point.

Prediction for a New Data Point


For a new point z, the prediction can be made using:
n
X
wz + b = λi ti xTi z + b.
i=1

Thus, classification of a new point involves computing the inner product between the point and
the support vectors.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 7


Slack Variables for Non-Linearly Separable Problems
In a linearly separable dataset, there exists a hyperplane that perfectly separates the classes, and
every point satisfies the constraint:
ti (wT xi + b) ≥ 1
where
ti is the class label (±1),
w is the weight vector, and b is the bias.
However, in real-world applications, data is rarely linearly separable. To allow for some degree of
misclassification, we introduce slack variables ηi ≥ 0.
With slack variables, the constraints become:

ti (wT xi + b) ≥ 1 − ηi

where:

ˆ ηi = 0 for correctly classified points on or beyond the margin.

ˆ 0 < ηi ≤ 1 for points inside the margin but on the correct side of the hyperplane.

ˆ ηi > 1 for points misclassified on the wrong side of the hyperplane.

Slack variables allow some points to violate the margin constraint, making the model more
flexible for non-linearly separable data.

Modified Objective Function for SVM with Slack Variables


Incorporating slack variables into the SVM optimization problem requires a trade-off between
maximizing the margin and minimizing classification errors. This leads to the objective function:
n
X
T
L(w, ϵ) = w w + C ϵi
i=1

where

-C is a regularization parameter that controls the trade-off between a wide margin and the
penalty for misclassifications.
- wTPw corresponds to maximizing the margin
- C ni=1 ηi penalizes points violating the margin. -ϵi distance of misclassified points from the
correct boundary line

ˆ If C is large: The optimization places higher priority on correctly classifying points, po-
tentially reducing the margin. Penalty of mis-classification is high.

ˆ If C is small: The optimization favors a larger margin, allowing more misclassifications if


necessary.

KKT Conditions for SVM with Slack Variables


The Karush-Kuhn-Tucker (KKT) conditions for this modified problem define the optimal solution
for the soft-margin SVM. These conditions are adjusted to account for the slack variables ηi :

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 8


ˆ Complementary slackness condition:

λi (1 − ti (wT xi + b) − ηi ) = 0

- If λi > 0, then the point lies exactly on the margin or within the margin boundary.
- If ηi > 0, then the point violates the margin constraint (it lies inside or beyond the margin).
ˆ Condition for support vectors:

(C − λi )ηi = 0

- If λi < C, then ηi = 0, indicating the point is a support vector lying on the margin
boundary.
- If λi = C and ηi > 1, the classifier misclassifies the point.
ˆ Separation constraint:
n
X
λi ti = 0
i=1

This ensures the separation constraint, indicating the balance between the support vectors
of both classes.

Kernel Trick in Machine Learning


The kernel trick is a method used in machine learning to enable algorithms, like Support Vector
Machines (SVMs), to operate in a high-dimensional space without directly computing the coor-
dinates in that space. This is useful for solving problems that are not linearly separable in their
original feature space. When we cannot linearly separate data in the original feature space, mod-
ifying the features can help. This idea is similar to the XOR problem we encountered earlier. By
transforming the data into a higher-dimensional space, we might find a linear decision boundary
that separates the classes. To achieve this, we introduce new functions ϕ(x) based on the input
features.
The key idea is to transform the input xi into a new form ϕ(xi ), while still being able to use
the SVM algorithm. Specifically, Equation (8.11) remains valid, but with xi replaced by ϕ(xi ).
The resulting prediction equation becomes:
n
X
wT x + b = λi ti ϕ(xi )T ϕ(z) + b.
i=1

The choice of functions ϕ(x) is critical. For instance, if we use a basis consisting of polynomials
up to degree 2, we can derive new features from the original input. A simple example for d = 3
dimensions would be:
√ √ √ √ √ √
Φ(x) = (1, 2x1 , 2x2 , 2x3 , x21 , x22 , x23 , 2x1 x2 , 2x1 x3 , 2x2 x3 ).
This transformation increases the dimensionality, making it computationally expensive. How-
ever, there is a trick: we don’t need to compute Φ(xi )T Φ(xj ) directly. Instead, we use the kernel
trick, which allows us to compute the dot product in the original space. For example, we can
express Φ(x)T Φ(y) as:

Φ(x)T Φ(y) = (1 + xT y)2 .


This reduces the computational cost from O(d2 ) to O(d), and the same applies to higher-order
polynomials.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 9


Why the Kernel Trick is Needed
In many machine learning tasks, such as classification, we want to find a hyperplane that best
separates the data points of different classes. However, in some cases, the data cannot be separated
with a straight line in the original feature space. The kernel trick helps by implicitly mapping
the data into a higher-dimensional space where it becomes easier to separate.

How the Kernel Trick Works


The kernel trick leverages the concept of a kernel function. Instead of computing the coor-
dinates of the points in the high-dimensional space (which would be computationally expensive),
the kernel function directly computes the inner product (dot product) of two points in that space.

Kernel Transformation We map the input space x to a higher-dimensional space ϕ(x) where
linear separation can be achieved:
ϕ : x → ϕ(x)
Instead of explicitly computing this mapping, SVMs use a kernel function K(xi , xj ) = ϕ(xi )T ϕ(xj )
to calculate the dot product directly in the transformed space.
Mathematically, a kernel function K(x, y) is defined as:

K(x, y) = ϕ(x) · ϕ(y)

where:
ˆ x and y are data points in the original feature space.

ˆ ϕ(x) is a mapping function that transforms x to a higher-dimensional space.

With the kernel trick, we don’t need to know or calculate ϕ(x) explicitly. Instead, we only
compute K(x, y), which gives us the inner product of the transformed vectors. This makes it
computationally feasible to work in higher dimensions without explicitly mapping the data.

Common Kernel Functions


Some commonly used kernel functions are:
1. Linear kernel: K(x, y) = x · y

ˆ Used when the data is already linearly separable.

2. Polynomial kernel: K(x, y) = (x · y + c)d

ˆ c: A constant that trades off the influence of higher-order versus lower-order terms.
Typically set to 1 or 0.
ˆ d: The degree of the polynomial. A higher d allows more complex decision boundaries.
ˆ Useful for creating polynomial decision boundaries, which can separate more complex
data.
 2

3. Gaussian (RBF) kernel: K(x, y) = exp − ∥x−y∥ 2σ 2

ˆ ∥x − y∥: The Euclidean distance between x and y.


ˆ σ: The bandwidth parameter, controlling the spread of the Gaussian. Smaller σ values
make the kernel more sensitive to changes in x and y.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 10


ˆ Used for non-linear data, as it can map points to an infinitely high-dimensional space.

4. Sigmoid kernel: K(x, y) = tanh(αx · y + c)

ˆ α: A scaling parameter, controlling the slope of the sigmoid function.


ˆ c: A constant that shifts the kernel function along the x-axis, controlling the threshold
of the sigmoid.
ˆ Similar to the activation function in neural networks, often used in neural network-
based methods.

Example
Imagine a dataset where the data points for two classes form concentric circles. This data is not
linearly separable in two-dimensional space, so a linear SVM would fail to classify it correctly.
However, with the kernel trick (using, for example, a Gaussian kernel), we can implicitly map
this data to a higher-dimensional space where it becomes linearly separable, allowing the SVM to
create a boundary that divides the two classes accurately.

Significance of the Kernel Trick


The kernel trick is significant because:

ˆ It allows complex decision boundaries in the original feature space by transforming


data into a higher-dimensional space where linear separation is possible.

ˆ It reduces computational complexity by avoiding the direct computation of high-


dimensional coordinates, making algorithms efficient even with complex mappings.

ˆ It enables algorithms to work with non-linear data and still achieve high accuracy,
as in SVMs and other machine learning methods.

In essence, the kernel trick enables algorithms to solve complex problems without a massive
increase in computational cost, making it a cornerstone technique in machine learning for non-
linear classification and regression tasks.

Support Vector Machine (SVM) Regression


Interestingly, the Support Vector Machine (SVM) can also be used for regression tasks. The
main idea is to adapt the usual least-squares error function, which typically aims to minimize the
difference between the predicted values and the actual target values. This error function can be
represented as:
N
1 X 1
(ti − yi )2 + λ∥w∥2 ,
2N i=1 2
where N is the number of data points, ti is the target value, yi is the predicted value, and w
represents the weights of the model.
To transform this into a form suitable for SVM regression, we utilize what is called an ϵ-
insensitive error function, denoted as Eϵ . This function behaves as follows: - It gives a value of
0 if the absolute difference between the target and the predicted output is less than ϵ. - If the
difference exceeds ϵ, it subtracts ϵ for consistency.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 11


This transformation helps focus on the points that are not well predicted, which means we can
maintain a small number of support vectors.
The modified error function can be expressed as:
N
X 1
Eϵ (ti − yi ) + λ ∥w∥2 ,
i=1
2
where Eϵ (ti − yi ) encapsulates the ϵ-insensitive behavior.

Understanding the Prediction Process


Instead of requiring the predictions to be exactly correct, we want them to fall within a tube
of radius ϵ around the ideal prediction line. To accommodate possible errors in prediction, we
introduce slack variables for each data point, denoted as ϵi for the ith data point.
The process involves using Lagrange multipliers, which help manage constraints, transform-
ing our problem into a dual form, applying a kernel function, and ultimately solving it using a
quadratic solver.
The final prediction for a test point z is given by the equation:
n
X
f (z) = (µi − λi K(xi , z)) + b,
i=1

where µi and λi are sets of constraint variables, and K(xi , z) represents the kernel function
applied to the data points.

Further Developments in SVM


There has been substantial advancement in kernel methods and SVMs, including optimization
techniques such as Sequential Minimal Optimization. Additionally, some methods aim to compute
posterior probabilities instead of making hard classification decisions, one such method being the
Relevance Vector Machine.
Various SVM implementations are available online that offer advanced features beyond those
discussed in the standard texts. While many of these implementations are written in C, some
provide interfaces for other programming languages, including Python. A simple internet search
can yield several options to explore, with popular choices including SVMLight, LIBSVM, and
scikit-learn.

III B.Tech-CSE(AIML)-MACHINE LEARNING - Dr.John Babu Page 12

You might also like