svm2 (1)fin
svm2 (1)fin
We would like to sincerely thank the authors of the paper “Support Vector
Machines” for their thorough and academic explanation of the basic ideas
behind SVMs. Additionally, we would like to sincerely thank Dr. Aruna
Bommareddi, our supervisor.
The research described in this report was critically based on the docu-
ment’s thorough treatment of fundamental subjects, such as margin theory,
optimization using Lagrange duality, kernel method development, and the
Sequential Minimal Optimization (SMO) algorithm.
1
Preface
This report’s sections are all intended to provide the reader with a thor-
ough understanding of Support Vector Machines by offering both theoretical
insights and real-world applications. The goal of this work is to be a use-
ful tool for practitioners, researchers, and students who want to learn more
about machine learning.
2
Table of Contents
1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Understanding Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Functional and Geometric Margin . . . . . . . . . . . . . . . . . . . . . . . . 7
5. The Most Effective Margin Classifier . . . . . . . . . . . . . . . . . . . . 10
6. Adding a Scaling Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7. Lagrange’s Dualistic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8. Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9. Efficiency of Kernel Computation . . . . . . . . . . . . . . . . . . . . . . . 16
10. Polynomial Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
11. Kernels Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
12. Application of Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
13. Regularization and Non-Separable Case . . . . . . . . . . . . . . . . 18
14. Coordinate Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
15. The SMO Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
16. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
17. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3
Abstract
Support Vector Machines (SVMs), a powerful class of supervised learning
algorithms that are commonly used for tasks involving regression and classifi-
cation, are thoroughly examined in this paper. The basic concept of margins
and their impact on classification confidence is introduced at the beginning
of the study.
The mechanics of constructing an optimal margin classifier using La-
grange duality—which is crucial for generating and resolving SVM optimiza-
tion problems—are then examined. In order to make SVMs more suitable for
complex, non-linearly separable data points, the study also examines the sig-
nificance of kernel functions in enabling them to operate in high-dimensional
or infinite-dimensional feature spaces.
Additionally, it is mentioned that SVMs can be trained very effectively
using the Sequential Minimal Optimization (SMO) technique, which reduces
computational complexity and improves scalability.
Through theoretical explanations and instructive examples, this paper
aims to provide a clear, structured understanding of SVM methodology, its
mathematical foundations, and practical implementation methodologies.
4
Support Vector Machines
1 Introduction
This paper presents the Support Vector Machine (SVM) learning algorithm.
SVMs are widely regarded as one of the most powerful supervised learning
algorithms currently in use. Investigating the concept of margins and the
significance of producing a discernible difference between data points is nec-
essary before comprehending the fundamentals of SVMs. The best margin
classifier will then be discussed, leading to Lagrange duality, a mathematical
tool essential to solving SVM optimization problems. Since kernels enable
SVMs to operate efficiently in high-dimensional or even infinite-dimensional
feature spaces, their function will also be examined. Finally, the discussion
will conclude with the Sequential Minimal Optimization (SMO) approach,
which provides a computationally efficient SVM implementation.
2 Understanding Margin
Since they provide insight into forecast confidence, the concept of margins
initiates the discussion of SVMs. Prior to their formal definition in a sub-
sequent section, this section aims to establish an intuitive understanding of
margins.
Consider logistic regression, which uses h(x) = g(k T x) to model the prob-
ability of a given class y = 1 given input x and parameter k. Class ”1” is
5
predicted when h(x) ≥ 0.5, which is equivalent to k T x ≥ 0. In a training
example where y = 1, a higher value of k T x corresponds to a higher likeli-
hood of classification into this class, increasing the prediction’s confidence.
In contrast, the model is fairly certain that y = 0 if k T x is substantially less
than zero.
Given a training dataset, an ideal model would ensure that the value of
k T x is significantly greater than zero for all positive cases (y = 1) and re-
mains significantly lower than zero for negative cases (y = 0). Achieving
this distinction would demonstrate a high level of category trust. Functional
margins would later be used to formalize this idea. Consider a scenario in
which ”x” represents positive training samples and ”o” represents negative
ones in order to further elucidate this concept. The formula k T x = 0 de-
scribes the decision boundary that separates the two classes. One can better
understand how distance from the decision boundary influences prediction
confidence by looking at three specific points, A, B, and C.
The decision boundary is far from Point A. The model can be pretty sure
that y = 1 if the objective is to predict y for A. Point C, however, is fairly close
to the border. Despite being on the side of the hyperplane that predicts y = 1,
a slight alteration to the boundary could easily yield a different classification.
As a result, the forecasting confidence level at C for y = 1 is extremely low.
Since Point B is situated in the middle of these two possibilities, it lends
credence to the idea that data points that are further away from the decision
boundary are associated with higher prediction confidence.
By maximizing the distance between the border and data points, a clas-
sification model should ideally identify a decision boundary that not only
accurately partitions the training data but also ensures that predictions are
generated with high confidence. This concept would later be formally defined
by geometric margins.
6
Understanding Notations
Before delving deeper into Support Vector Machines (SVMs), a well-designed
notation system must be presented to facilitate the discussion of classifica-
tion. This paper examines a linear classifier designed for a binary classi-
fication problem in which the data consists of corresponding labels y and
characteristics x.
In this framework, the class labels will be represented as y ∈ {−1, 1}
instead of the standard {0, 1}. The linear classifier will be specified using the
parameters w and b rather than the parameter vector θ. Next, the classifier
is expressed as follows:
1, if z ≥ 0
g(z) =
−1, otherwise
This notation clearly separates the bias term b from the weight vector w,
in contrast to the previous method that added an additional feature x0 =
1 into the input vector. In this formulation, w represents the remaining
parameters, which were previously expressed as [k1 , ..., kn ]T , and b represents
what was previously known as k0 .
In contrast to logistic regression, which determines the likelihood of y = 1
before making a decision, this classifier generates class labels {−1, 1} directly
without determining a probability estimate in between.
7
The term pT x(i) + b should be a very large positive value for a correctly
and securely categorized positive example (y (i) = 1). On the other hand,
given a negative example (y (i) = −1), a high negative value is preferred.
Furthermore, if y (i) (pT x(i) + b) > 0, the classification for that particular case
is accurate. A greater functional margin therefore indicates more confidence
in classification.
Using functional margins as a confidence gauge, however, has its draw-
backs. The functional margin can be raised arbitrarily since the function g
solely relies on the sign of pT x+b and not its size, thereby not influencing the
classification result. For example, the functional margin is doubled but the
classifier stays the same if both p and b are scaled by a factor of two. This
shows that without adding a normalization restriction, the functional margin
alone does not serve as a valid measure of classification confidence. Limiting
the weight vector until ||p||2 = 1 is an alternate normalization technique that
normalizes the functional margin. This idea will be looked at again later.
Given a dataset S = {(x(i) , y (i) )}m i=1 , the classifier’s functional margin
relative to the entire dataset is the smallest functional margin of all training
examples:
This ensures that throughout all training samples, the classifier maintains
a constant level of confidence.
Geometric Margins
The decision boundary denoted by (p, b) is accompanied by the weight vector
p, which is always orthogonal to the hyperplane separating the two classes.
The concept of geometric margins will be illustrated with the aid of a training
example at point A with the label y (i) = 1. The shortest distance between
this location and the decision border is represented by segment AB, which is
8
expressed as t(i) .
One can determine the value of t(i) by noting that the normalized vec-
p
tor ||p|| has unit length and points in the same direction as p. Point B’s
coordinates can be computed as follows:
p
x(i) − t(i) . (3)
||p||
Because B is exactly on the decision boundary, every point on this hy-
perplane satisfies the equation:
p
pT (x(i) − t(i) ) + b = 0. (4)
||p||
Solving for t(i) yields:
T
pT x(i) + b
(i) p b
t = = x(i) + . (5)
||p|| ||p|| ||p||
This derivation was specifically for a positive training scenario, even
though the definition applies to all circumstances. Regarding a training
example (x(i) , y (i) ), the classifier’s geometric margin is given by:
T
(i) (i) p b
t =y x(i) + . (6)
||p|| ||p||
One noteworthy feature is the invariance of the geometric margin under w
and b rescaling. If both p and b are scaled by the same amount, the geometric
margin remains constant, in contrast to the functional margin. Because it
allows one to impose arbitrary limitations on the p scale without altering
the categorization behavior, this feature is useful later in optimization. For
instance, it is possible to apply restrictions such as ||p|| = 1 or |p1 | = 5
without altering the classification properties.
For a dataset S = {(x(i) , y (i) )}m
i=1 , the classifier’s geometric margin rela-
tive to the entire dataset is ultimately defined as follows:
9
t = min t(i) . (7)
i=1,...,m
This ensures that for each training sample, the classifier maintains the
ideal distance between classes.
max t (8)
p,b
subject to:
∥p∥ = 1. (10)
10
the maximum practical geometric margin for the given dataset.
While the desired classifier would ideally be obtained by solving the afore-
mentioned optimization problem, the constraint ∥p∥ = 1 poses a significant
challenge. Due to the non-convex nature of this constraint, the problem is
not one that can be easily solved using standard optimization techniques. To
move forward, we consider an alternative formulation:
t̂
max (11)
p,b ∥p∥
subject to:
With this formulation, we aim to maximize t̂/∥p∥, provided that all func-
tional margins are at least t̂. Because t = t̂/∥p∥ connects the geometric and
functional margins, maximizing this expression yields the desired result. The
problematic ∥p∥ = 1 constraint is also removed by this reformulation.
However, the new formulation introduces a new challenge: the function
t̂/∥p∥ is still non-convex, making it more difficult to solve using conventional
optimization techniques. To assist with this, we make use of the fact that the
values p and b can be scaled indefinitely without altering the classification
border. This knowledge enables us to apply a scale restriction, which is
necessary to transform the issue into a more manageable form.
t̂ = 1. (13)
11
ally scaled by multiplying p and b by a constant, so appropriate parameter
rescaling will always meet it. We introduce the following optimization prob-
lem by substituting this constraint into the previous formulation and noting
that maximizing t̂/∥p∥ is equal to minimizing ∥p∥:
1
min ∥p∥2 (14)
p,b 2
subject to:
12
Consider the following type of optimization problem:
hi (p) = 0, ∀i = 1, . . . , l. (17)
One common technique for resolving such problems is the Lagrange multiplier
method. This method introduces the Lagrangian function, which has the
following definition:
l
X
L(p, m) = f (p) + mi hi (p). (18)
i=1
Lagrange multipliers are the values of the parameters βi in this case. The
problem can be solved by computing the partial derivatives of L and setting
them to zero:
∂L ∂L
= 0, = 0. (19)
∂pi ∂mi
The optimal values of p and m are obtained by solving this set of equations.
13
This method reflects the primal problem even though the order of min-
imization and maximization is reversed. The optimal value for the dual
problem is given by:
d∗ = max kD (s, m). (22)
s,m:si ≥0
The fact that, in general, the maximum of a minimum is always less than
or equal to the minimum of a maximum leads to this inequality. However,
in some cases, we obtain strong duality, which suggests that:
d∗ = p∗
• The objective function f (p) and the inequality constraints gi (p) are
convex.
• There is at least one p such that gi (p) < 0 for all i, indicating that the
constraints are strictly feasible.
p∗ = d∗ = L(p∗ , s∗ , m∗ ). (24)
14
Additionally, the following Karush-Kuhn-Tucker (KKT) criteria are sat-
isfied by these optimal values:
1. Stationary:
∂
L(p∗ , s∗ , m∗ ) = 0, i = 1, . . . , n. (25)
∂pi
∂
L(p∗ , s∗ , m∗ ) = 0, i = 1, . . . , l. (26)
∂mi
2. Complementary Slackness:
3. Primary Feasibility:
gi (p∗ ) ≤ 0, i = 1, . . . , k. (28)
4. Dual Feasibility:
s∗i ≥ 0, i = 1, . . . , k. (29)
7 Kernels
In our earlier discussion of linear regression, we encountered a scenario in
which the input variable represented a home’s living area. We built a cubic
15
function because we considered using features like x2 and x3 to enhance the
regression model. To distinguish between different sets of variables, we define
the ”original” input variable as the input characteristics of the problem, in
this case, the dwelling area. When these characteristics are transformed into
a new set of variables used in the learning process, we refer to them as input
features.
To show this change, we use ϕ(x) to represent the feature mapping that
transforms input attributes into input features. For example, in the given
circumstance:
ϕ(x) = (x, x2 , x3 )
16
We find that it corresponds to the feature mapping:
√
ϕ(x) = (1, 2x, x2 )
9 Polynomial Kernels
One commonly used polynomial kernel function is:
10 Kernel Similarity
If ϕ(x) and ϕ(x′ ) are close together in feature space, then K(x, x′ ) will be
large. However, if they are nearly orthogonal, K(x, x′ ) will be small. A
popular kernel for measuring similarity is the Gaussian kernel:
∥x − x′ ∥2
′
K(x, x ) = exp −
2σ 2
17
11 Applications of Kernel Methods
Kernels have many applications beyond SVMs. For example: - SVMs with
polynomial or Gaussian kernels perform well in digit recognition. - In bioin-
formatics, kernel methods classify strings by counting substring occurrences.
The kernel trick allows algorithms relying on inner products to operate
in high-dimensional spaces efficiently, making kernel methods a cornerstone
of modern machine learning.
with:
y (i) (pT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m (31)
ξi ≥ 0, i = 1, . . . , m. (32)
With this method, some data points can have a functional margin smaller
than 1. If a given example has a margin of 1 − ξi (where ξi > 0), then
18
increasing the objective function by Cξi results in a penalty. The trade-
off between ensuring that the majority of examples maintain a functional
margin of at least one and minimizing ||p||2 , which maximizes the margin, is
controlled by the parameter C.
13 Coordinate Ascent
Consider the problem of finishing the subsequent unconstrained optimization:
1. For i = 1, . . . , m:
sˆi = arg max W (s1 , ..., si−1 , sˆi , si+1 , ..., sm ). (34)
si
19
initial point. In particular, since only one variable is altered at a time, each
action runs parallel to one of the axes.
beneath:
0 ≤ si ≤ C, i = 1, . . . , m, (36)
m
X
si y (i) = 0. (37)
i=1
article
The following is the result of multiplying both sides by y (1) :
m
X
(1)
s1 = −y si y (i) . (39)
i=2
20
ensuring that the remaining si values fully define s1 . As a result, under the
limit, s1 cannot be changed independently.
The SMO approach, which maintains constraint satisfaction by updating
at least two variables at once, is motivated by this discovery. The method
works as follows:
This constraint restricts s1 and s2 to lie on a line within the feasible region
denoted by 0 ≤ si ≤ C. In order to ensure viability within the [0, C] × [0, C]
box, the permitted values of s2 must also satisfy the lower and upper limits
L and H.
21
Rewriting s1 as a function of s2 :
Since all other si values are constant, this turns into a quadratic function
in s2 :
as22 + bs2 + c. (44)
Equating the derivative to zero yields the best s2 when the box limits
are ignored. If the resultant number falls outside of [R, U ], it is changed (or
”clipped”) as follows:
U, if sunclipped
2 >U
snew
2 = sunclipped
2 , if R ≤ sunclipped
2 ≤U (45)
if sunclipped
R,
2 <R
22
Conclusion
Support Vector Machines (SVMs) are among the best supervised learning
algorithms, particularly helpful for classification in high-dimensional spaces.
Their main goal is to identify the hyperplane that produces the strongest
generalization on new data by maximally separating classes.
SVMs use convex optimization to identify the best margin classifier and
present the ideas of functional and geometric margins. The kernel trick
is the process of employing kernel functions to efficiently operate in high-
dimensional feature spaces by utilizing Lagrange duality without explicitly
computing the transformation.
SVMs use a regularization parameter and slack variables to handle non-
linearly separable data in order to balance margin size and misclassification.
The dual problem of the SVM can be effectively resolved with the help
of the SMO (Sequential Minimal Optimization) algorithm.
23
Refrences
• Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Springer.
• Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A Library for Support
Vector Machines. ACM Transactions on Intelligent Systems and Tech-
nology (TIST), 2(3), 1–27.
24