SVM_NEW
SVM_NEW
1
Applications of Support Vector Machines
SVMs have a wide range of applications across various fields due to their robustness and accuracy.
Some notable applications include:
- Image Classification: SVMs are widely used for classifying images in computer vision tasks,
such as face detection and object recognition.
- Text Classification: SVMs are effective in categorizing text documents, such as spam detection
in emails or sentiment analysis.
- Bioinformatics: SVMs are applied in gene expression classification and protein structure
prediction, helping in medical diagnosis and treatment planning.
- Finance: In finance, SVMs are used for credit scoring, risk assessment, and fraud detection
by analyzing historical transaction data.
- Handwriting Recognition: SVMs have been employed in recognizing handwritten characters
and digits, enhancing the accuracy of Optical Character Recognition (OCR) systems.
- Speech Recognition: SVMs are also used in speech recognition systems to classify spoken
words or phrases.
Support Vector Machines are a versatile and powerful tool in the machine learning toolkit,
known for their ability to handle both linear and non-linear classification problems effectively.
Their theoretical foundation and practical applications make them a key algorithm for researchers
and practitioners in various domains.
Optimal Separation
In the context of Support Vector Machines SVM, optimal separation refers to the process of finding
the best hyperplane that divides two classes of data in such a way that maximizes the margin
between them. The margin is defined as the distance between the hyperplane and the nearest
data points from each class, which are known as support vectors.
The goal of optimal separation is to create a decision boundary that not only separates the
two classes but does so in a manner that minimizes the chance of misclassification when new data
points are introduced. The optimization problem can be expressed mathematically as:
minimize 12 ||w||2
subject to ti (w · xi + b) ≥ 1, ∀i
where w is the weight vector, b is the bias term, ti is the class label for data point i, and xi is the
feature vector of data point i.
minimize 12 ||w||2
subject to ti (w · xi + b) ≥ 1, ∀i
In this formulation:
- w is the weight vector that defines the orientation of the hyperplane.
- b is the bias term that shifts the hyperplane away from the origin.
- ti is the class label for the training data points xi .
The condition ensures that points of one class yield positive results while points of the other
class yield negative results when plugged into the equation.
Lagrange Multipliers
Lagrange multipliers are a method used in optimization to find the local maxima and minima of
a function subject to equality constraints. This technique is particularly useful when we need to
optimize a function while ensuring that certain conditions (constraints) are met.
Constraints: These are the conditions that must be satisfied, denoted as g(x, y) = 0.
Lagrangian Function: The Lagrangian incorporates both the objective function and the
constraint. It is defined as:
2. Take partial derivatives of the Lagrangian with respect to each variable (including the mul-
tiplier).
Simple Example
Let’s illustrate this with a basic example.
Example Problem
Maximize the function f (x, y) = xy subject to the constraint g(x, y) = x + y − 10 = 0.
Objective Function:
1
min ∥w∥2
w,b 2
where ∥w∥2 = wT w (the squared norm of the weight vector).
Constraints: For all data points (xi , yi ) where yi is the label (+1 or -1):
yi (wT xi + b) ≥ 1
These constraints ensure that each point is correctly classified and at least 1 unit away from
the decision boundary.
λi (1 − ti (wT xi + b)) = 0
1 − ti (wT xi + b) ≤ 0
λi ≥ 0
Here, λi are Lagrange multipliers, which allow us to solve constrained optimization prob-
lems.
The first condition implies that if λi = 0, then ti (w∗T xi + b∗ ) = 1, meaning that the constraint
holds as an equality for support vectors. These support vectors lie on the boundary of the
margin, and their constraints hold as equalities, reducing the number of datapoints that need to
be considered.
Lagrangian Function
We define the Lagrangian for the problem as:
n
1 X
L(w, b, λ) = wT w + λi (1 − ti (wT xi + b)).
2 i=1
n
∂L X
=− λi ti .
∂b i=1
Setting these derivatives equal to zero gives us the optimal values for w and b:
n
X
∗
w = λi ti xi
i=1
n
X
λi ti = 0.
i=1
Substituting these values into the Lagrangian function yields the dual problem, where we aim
to maximize the following with respect to λi :
n n n
∗ ∗
X 1 XX
L(w , b , λ) = λi − λi λj ti tj xTi xj ,
i=1
2 i=1 j=1
Pn
subject to λi ≥ 0 and i=1 λi ti = 0.
where: - tj is the label of the support vector (either +1 or −1). - wT xj + b represents the linear
boundary applied to the support vector xj .
Expressing w Using Lagrange Multipliers: In terms of the support vectors, we can write
w as: n
X
w= λi ti xi
i=1
where: - λi are the Lagrange multipliers. - xi are the training data points. - Only points with
λi > 0 are support vectors and contribute to w.
Substitute w in the Support Vector Condition: Substitute w into the support vector
condition: !T
Xn
tj λi ti xi x j + b = 1
i=1
Rearranging, we obtain:
n
X
b = tj − λi ti xTi xj
i=1
Averaging Over All Support Vectors: Since any support vector xj should satisfy this
equation, we compute b for each support vector and then average to obtain a stable value. With
Ns support vectors, this becomes:
n
!
1 X X
b= tj − λi ti xTi xj
Ns support vectors j i=1
This averaging approach gives a more robust calculation of b across all support vectors, rather
than relying on a single point.
Thus, classification of a new point involves computing the inner product between the point and
the support vectors.
ti (wT xi + b) ≥ 1 − ηi
where:
0 < ηi ≤ 1 for points inside the margin but on the correct side of the hyperplane.
Slack variables allow some points to violate the margin constraint, making the model more
flexible for non-linearly separable data.
where
-C is a regularization parameter that controls the trade-off between a wide margin and the
penalty for misclassifications.
- wTPw corresponds to maximizing the margin
- C ni=1 ηi penalizes points violating the margin. -ϵi distance of misclassified points from the
correct boundary line
If C is large: The optimization places higher priority on correctly classifying points, po-
tentially reducing the margin. Penalty of mis-classification is high.
λi (1 − ti (wT xi + b) − ηi ) = 0
- If λi > 0, then the point lies exactly on the margin or within the margin boundary.
- If ηi > 0, then the point violates the margin constraint (it lies inside or beyond the margin).
Condition for support vectors:
(C − λi )ηi = 0
- If λi < C, then ηi = 0, indicating the point is a support vector lying on the margin
boundary.
- If λi = C and ηi > 1, the classifier misclassifies the point.
Separation constraint:
n
X
λi ti = 0
i=1
This ensures the separation constraint, indicating the balance between the support vectors
of both classes.
The choice of functions ϕ(x) is critical. For instance, if we use a basis consisting of polynomials
up to degree 2, we can derive new features from the original input. A simple example for d = 3
dimensions would be:
√ √ √ √ √ √
Φ(x) = (1, 2x1 , 2x2 , 2x3 , x21 , x22 , x23 , 2x1 x2 , 2x1 x3 , 2x2 x3 ).
This transformation increases the dimensionality, making it computationally expensive. How-
ever, there is a trick: we don’t need to compute Φ(xi )T Φ(xj ) directly. Instead, we use the kernel
trick, which allows us to compute the dot product in the original space. For example, we can
express Φ(x)T Φ(y) as:
Kernel Transformation We map the input space x to a higher-dimensional space ϕ(x) where
linear separation can be achieved:
ϕ : x → ϕ(x)
Instead of explicitly computing this mapping, SVMs use a kernel function K(xi , xj ) = ϕ(xi )T ϕ(xj )
to calculate the dot product directly in the transformed space.
Mathematically, a kernel function K(x, y) is defined as:
where:
x and y are data points in the original feature space.
With the kernel trick, we don’t need to know or calculate ϕ(x) explicitly. Instead, we only
compute K(x, y), which gives us the inner product of the transformed vectors. This makes it
computationally feasible to work in higher dimensions without explicitly mapping the data.
c: A constant that trades off the influence of higher-order versus lower-order terms.
Typically set to 1 or 0.
d: The degree of the polynomial. A higher d allows more complex decision boundaries.
Useful for creating polynomial decision boundaries, which can separate more complex
data.
2
3. Gaussian (RBF) kernel: K(x, y) = exp − ∥x−y∥ 2σ 2
Example
Imagine a dataset where the data points for two classes form concentric circles. This data is not
linearly separable in two-dimensional space, so a linear SVM would fail to classify it correctly.
However, with the kernel trick (using, for example, a Gaussian kernel), we can implicitly map
this data to a higher-dimensional space where it becomes linearly separable, allowing the SVM to
create a boundary that divides the two classes accurately.
It enables algorithms to work with non-linear data and still achieve high accuracy,
as in SVMs and other machine learning methods.
In essence, the kernel trick enables algorithms to solve complex problems without a massive
increase in computational cost, making it a cornerstone technique in machine learning for non-
linear classification and regression tasks.
where µi and λi are sets of constraint variables, and K(xi , z) represents the kernel function
applied to the data points.