0% found this document useful (0 votes)
3 views

svm2 (1)fin

This document provides a comprehensive exploration of Support Vector Machines (SVMs), detailing their theoretical foundations, optimization techniques, and practical applications. It acknowledges the contributions of prior authors and highlights key concepts such as margin theory, Lagrange duality, and the Sequential Minimal Optimization (SMO) algorithm. The report aims to serve as a valuable resource for practitioners, researchers, and students interested in machine learning.

Uploaded by

devanshur222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

svm2 (1)fin

This document provides a comprehensive exploration of Support Vector Machines (SVMs), detailing their theoretical foundations, optimization techniques, and practical applications. It acknowledges the contributions of prior authors and highlights key concepts such as margin theory, Lagrange duality, and the Sequential Minimal Optimization (SMO) algorithm. The report aims to serve as a valuable resource for practitioners, researchers, and students interested in machine learning.

Uploaded by

devanshur222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Acknowledgment

We would like to sincerely thank the authors of the paper “Support Vector
Machines” for their thorough and academic explanation of the basic ideas
behind SVMs. Additionally, we would like to sincerely thank Dr. Aruna
Bommareddi, our supervisor.

The research described in this report was critically based on the docu-
ment’s thorough treatment of fundamental subjects, such as margin theory,
optimization using Lagrange duality, kernel method development, and the
Sequential Minimal Optimization (SMO) algorithm.

Our comprehension and use of Support Vector Machines were substan-


tially improved by the methodical presentation of mathematical formulations
coupled with understandable, intuitive explanations. Our work’s theoretical
and practical components were enhanced by the thoroughness and accuracy
with which difficult ideas were handled.

We owe a debt of gratitude to the authors for their invaluable contribu-


tion, which was essential to the accomplishment of this study.

1
Preface

A thorough and organized investigation of Support Vector Machines (SVMs),


a crucial technique in supervised machine learning, is provided in this report.
The content is arranged according to major themes, each of which contributes
to a better comprehension of the SVM framework:

• The Overview of Support Vector Machines: The report starts off


by outlining the rationale for SVMs and the importance of establishing
a distinct margin between classes in order to enhance classification
performance.

• Maximum Margin Optimization: The SVM optimization problem


formulation is shown, explaining how to convert a complicated non-
convex problem into a convex one that can be solved computationally.

• The Dual Problem and Support Vectors: The importance of


support vectors is highlighted, elucidating how the decision boundary
is determined by a subset of training points only, thereby increasing
model efficiency.

• Sequential Minimal Optimization (SMO): The dual optimiza-


tion problem in large-scale datasets is solved by the effective algorithm
SMO, which is described in detail along with how it updates variable
pairs under constraint conditions.

This report’s sections are all intended to provide the reader with a thor-
ough understanding of Support Vector Machines by offering both theoretical
insights and real-world applications. The goal of this work is to be a use-
ful tool for practitioners, researchers, and students who want to learn more
about machine learning.

2
Table of Contents

1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Understanding Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Functional and Geometric Margin . . . . . . . . . . . . . . . . . . . . . . . . 7
5. The Most Effective Margin Classifier . . . . . . . . . . . . . . . . . . . . 10
6. Adding a Scaling Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7. Lagrange’s Dualistic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8. Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9. Efficiency of Kernel Computation . . . . . . . . . . . . . . . . . . . . . . . 16
10. Polynomial Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
11. Kernels Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
12. Application of Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
13. Regularization and Non-Separable Case . . . . . . . . . . . . . . . . 18
14. Coordinate Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
15. The SMO Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
16. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
17. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3
Abstract
Support Vector Machines (SVMs), a powerful class of supervised learning
algorithms that are commonly used for tasks involving regression and classifi-
cation, are thoroughly examined in this paper. The basic concept of margins
and their impact on classification confidence is introduced at the beginning
of the study.
The mechanics of constructing an optimal margin classifier using La-
grange duality—which is crucial for generating and resolving SVM optimiza-
tion problems—are then examined. In order to make SVMs more suitable for
complex, non-linearly separable data points, the study also examines the sig-
nificance of kernel functions in enabling them to operate in high-dimensional
or infinite-dimensional feature spaces.
Additionally, it is mentioned that SVMs can be trained very effectively
using the Sequential Minimal Optimization (SMO) technique, which reduces
computational complexity and improves scalability.
Through theoretical explanations and instructive examples, this paper
aims to provide a clear, structured understanding of SVM methodology, its
mathematical foundations, and practical implementation methodologies.

4
Support Vector Machines

1 Introduction
This paper presents the Support Vector Machine (SVM) learning algorithm.
SVMs are widely regarded as one of the most powerful supervised learning
algorithms currently in use. Investigating the concept of margins and the
significance of producing a discernible difference between data points is nec-
essary before comprehending the fundamentals of SVMs. The best margin
classifier will then be discussed, leading to Lagrange duality, a mathematical
tool essential to solving SVM optimization problems. Since kernels enable
SVMs to operate efficiently in high-dimensional or even infinite-dimensional
feature spaces, their function will also be examined. Finally, the discussion
will conclude with the Sequential Minimal Optimization (SMO) approach,
which provides a computationally efficient SVM implementation.

2 Understanding Margin
Since they provide insight into forecast confidence, the concept of margins
initiates the discussion of SVMs. Prior to their formal definition in a sub-
sequent section, this section aims to establish an intuitive understanding of
margins.
Consider logistic regression, which uses h(x) = g(k T x) to model the prob-
ability of a given class y = 1 given input x and parameter k. Class ”1” is

5
predicted when h(x) ≥ 0.5, which is equivalent to k T x ≥ 0. In a training
example where y = 1, a higher value of k T x corresponds to a higher likeli-
hood of classification into this class, increasing the prediction’s confidence.
In contrast, the model is fairly certain that y = 0 if k T x is substantially less
than zero.
Given a training dataset, an ideal model would ensure that the value of
k T x is significantly greater than zero for all positive cases (y = 1) and re-
mains significantly lower than zero for negative cases (y = 0). Achieving
this distinction would demonstrate a high level of category trust. Functional
margins would later be used to formalize this idea. Consider a scenario in
which ”x” represents positive training samples and ”o” represents negative
ones in order to further elucidate this concept. The formula k T x = 0 de-
scribes the decision boundary that separates the two classes. One can better
understand how distance from the decision boundary influences prediction
confidence by looking at three specific points, A, B, and C.
The decision boundary is far from Point A. The model can be pretty sure
that y = 1 if the objective is to predict y for A. Point C, however, is fairly close
to the border. Despite being on the side of the hyperplane that predicts y = 1,
a slight alteration to the boundary could easily yield a different classification.
As a result, the forecasting confidence level at C for y = 1 is extremely low.
Since Point B is situated in the middle of these two possibilities, it lends
credence to the idea that data points that are further away from the decision
boundary are associated with higher prediction confidence.
By maximizing the distance between the border and data points, a clas-
sification model should ideally identify a decision boundary that not only
accurately partitions the training data but also ensures that predictions are
generated with high confidence. This concept would later be formally defined
by geometric margins.

6
Understanding Notations
Before delving deeper into Support Vector Machines (SVMs), a well-designed
notation system must be presented to facilitate the discussion of classifica-
tion. This paper examines a linear classifier designed for a binary classi-
fication problem in which the data consists of corresponding labels y and
characteristics x.
In this framework, the class labels will be represented as y ∈ {−1, 1}
instead of the standard {0, 1}. The linear classifier will be specified using the
parameters w and b rather than the parameter vector θ. Next, the classifier
is expressed as follows:

1, if z ≥ 0
g(z) =
−1, otherwise

This notation clearly separates the bias term b from the weight vector w,
in contrast to the previous method that added an additional feature x0 =
1 into the input vector. In this formulation, w represents the remaining
parameters, which were previously expressed as [k1 , ..., kn ]T , and b represents
what was previously known as k0 .
In contrast to logistic regression, which determines the likelihood of y = 1
before making a decision, this classifier generates class labels {−1, 1} directly
without determining a probability estimate in between.

3 Functional and Geometric Margin


Understanding categorization confidence can be improved by defining both
geometric and functional margins. Given a training example (x(i) , y (i) ), the
functional margin of the classifier parameters (w, b) is expressed as follows:

t(i) = y (i) (pT x(i) + b). (1)

7
The term pT x(i) + b should be a very large positive value for a correctly
and securely categorized positive example (y (i) = 1). On the other hand,
given a negative example (y (i) = −1), a high negative value is preferred.
Furthermore, if y (i) (pT x(i) + b) > 0, the classification for that particular case
is accurate. A greater functional margin therefore indicates more confidence
in classification.
Using functional margins as a confidence gauge, however, has its draw-
backs. The functional margin can be raised arbitrarily since the function g
solely relies on the sign of pT x+b and not its size, thereby not influencing the
classification result. For example, the functional margin is doubled but the
classifier stays the same if both p and b are scaled by a factor of two. This
shows that without adding a normalization restriction, the functional margin
alone does not serve as a valid measure of classification confidence. Limiting
the weight vector until ||p||2 = 1 is an alternate normalization technique that
normalizes the functional margin. This idea will be looked at again later.
Given a dataset S = {(x(i) , y (i) )}m i=1 , the classifier’s functional margin
relative to the entire dataset is the smallest functional margin of all training
examples:

t̂ = min t(i) . (2)


i=1,...,m

This ensures that throughout all training samples, the classifier maintains
a constant level of confidence.

Geometric Margins
The decision boundary denoted by (p, b) is accompanied by the weight vector
p, which is always orthogonal to the hyperplane separating the two classes.
The concept of geometric margins will be illustrated with the aid of a training
example at point A with the label y (i) = 1. The shortest distance between
this location and the decision border is represented by segment AB, which is

8
expressed as t(i) .
One can determine the value of t(i) by noting that the normalized vec-
p
tor ||p|| has unit length and points in the same direction as p. Point B’s
coordinates can be computed as follows:

p
x(i) − t(i) . (3)
||p||
Because B is exactly on the decision boundary, every point on this hy-
perplane satisfies the equation:

p
pT (x(i) − t(i) ) + b = 0. (4)
||p||
Solving for t(i) yields:
T
pT x(i) + b

(i) p b
t = = x(i) + . (5)
||p|| ||p|| ||p||
This derivation was specifically for a positive training scenario, even
though the definition applies to all circumstances. Regarding a training
example (x(i) , y (i) ), the classifier’s geometric margin is given by:
 T
(i) (i) p b
t =y x(i) + . (6)
||p|| ||p||
One noteworthy feature is the invariance of the geometric margin under w
and b rescaling. If both p and b are scaled by the same amount, the geometric
margin remains constant, in contrast to the functional margin. Because it
allows one to impose arbitrary limitations on the p scale without altering
the categorization behavior, this feature is useful later in optimization. For
instance, it is possible to apply restrictions such as ||p|| = 1 or |p1 | = 5
without altering the classification properties.
For a dataset S = {(x(i) , y (i) )}m
i=1 , the classifier’s geometric margin rela-
tive to the entire dataset is ultimately defined as follows:

9
t = min t(i) . (7)
i=1,...,m

This ensures that for each training sample, the classifier maintains the
ideal distance between classes.

4 The Most Effective Margin Classifier


Finding a decision boundary that maximizes the geometric margin given a
training dataset is an obvious objective based on our earlier discussion. When
a classifier achieves this, it indicates a strong match to the training data and
provides a high level of confidence in its predictions. For such a classifier,
keeping a significant gap ensures a distinct separation between positive and
negative training instances.
Assume for the purposes of this discussion that the given training set
is linearly separable, or that the two classes may be precisely separated by
at least one hyperplane. Finding the hyperplane with the largest geometric
margin is our goal. This leads to the following optimization formulation:

max t (8)
p,b

subject to:

y (i) (pT x(i) + b) ≥ t, ∀i = 1, . . . , m (9)

∥p∥ = 1. (10)

In order to ensure that each training example has a functional margin


of at least t, the goal here is to maximize t. All geometric margins are
guaranteed to be at least t since the condition ∥p∥ = 1 ensures that the
functional margin is equal to the geometric margin. The best parameters
(p, b) will thus be obtained by solving this optimization problem, providing

10
the maximum practical geometric margin for the given dataset.
While the desired classifier would ideally be obtained by solving the afore-
mentioned optimization problem, the constraint ∥p∥ = 1 poses a significant
challenge. Due to the non-convex nature of this constraint, the problem is
not one that can be easily solved using standard optimization techniques. To
move forward, we consider an alternative formulation:


max (11)
p,b ∥p∥
subject to:

y (i) (pT x(i) + b) ≥ t̂, ∀i = 1, . . . , m. (12)

With this formulation, we aim to maximize t̂/∥p∥, provided that all func-
tional margins are at least t̂. Because t = t̂/∥p∥ connects the geometric and
functional margins, maximizing this expression yields the desired result. The
problematic ∥p∥ = 1 constraint is also removed by this reformulation.
However, the new formulation introduces a new challenge: the function
t̂/∥p∥ is still non-convex, making it more difficult to solve using conventional
optimization techniques. To assist with this, we make use of the fact that the
values p and b can be scaled indefinitely without altering the classification
border. This knowledge enables us to apply a scale restriction, which is
necessary to transform the issue into a more manageable form.

5 Adding a Scaling Limitation


We include a constraint that sets the functional margin of (w, b) with respect
to the training set to exactly 1:

t̂ = 1. (13)

This restriction makes sense because the functional margin is proportion-

11
ally scaled by multiplying p and b by a constant, so appropriate parameter
rescaling will always meet it. We introduce the following optimization prob-
lem by substituting this constraint into the previous formulation and noting
that maximizing t̂/∥p∥ is equal to minimizing ∥p∥:

1
min ∥p∥2 (14)
p,b 2

subject to:

y (i) (pT x(i) + b) ≥ 1, ∀i = 1, . . . , m. (15)

This restated problem can now be solved computationally efficiently in a


convex form. In particular, the constraints are linear and the goal function
is quadratic, allowing the best margin classifier to be found using standard
quadratic programming (QP) techniques.
We take a slight detour to explore Lagrange duality, which will result
in the dual version of our optimization problem, even though this method
provides a good solution. The dual formulation is crucial because it enables
the use of kernel approaches, which enable support vector machines (SVMs)
to function efficiently in high-dimensional feature spaces. Additionally, a
fast optimization algorithm that significantly outperforms conventional QP
solvers can be derived using the dual formulation. We will take a closer look
at this alternative approach.

6 Lagrange’s Dualistic Theory


To better understand the mathematical foundation of support vector ma-
chines (SVMs) and maximum margin classifiers, we first explore the general
concept of restricted optimization and its solution via Lagrange duality.

12
Consider the following type of optimization problem:

min f (p) (16)


w

Considering the constraints of equality:

hi (p) = 0, ∀i = 1, . . . , l. (17)

One common technique for resolving such problems is the Lagrange multiplier
method. This method introduces the Lagrangian function, which has the
following definition:

l
X
L(p, m) = f (p) + mi hi (p). (18)
i=1

Lagrange multipliers are the values of the parameters βi in this case. The
problem can be solved by computing the partial derivatives of L and setting
them to zero:
∂L ∂L
= 0, = 0. (19)
∂pi ∂mi
The optimal values of p and m are obtained by solving this set of equations.

The Two Formulations


Now, we define a dual function as:

kD (s, m) = min L(p, s, m), (20)


p

where the dual formulation is indicated by the subscript “D.”


Here, we first minimize over p, in contrast to the primal function kP ,
where we maximized over s, m. The dual optimization problem that results
is as follows:
max kD (s, m) = max min L(p, s, m). (21)
s,m:si ≥0 s,m:si ≥0 p

13
This method reflects the primal problem even though the order of min-
imization and maximization is reversed. The optimal value for the dual
problem is given by:
d∗ = max kD (s, m). (22)
s,m:si ≥0

The Connection Between Dual and Primal Problems


The optimal value of the dual problem is always less than or equal to the
optimal value of the primal problem, according to a fundamental optimization
result:

d∗ = max min L(p, s, m) ≤ min max L(p, s, m) = p∗ . (23)


s,m:si ≥0 p p s,m:si ≥0

The fact that, in general, the maximum of a minimum is always less than
or equal to the minimum of a maximum leads to this inequality. However,
in some cases, we obtain strong duality, which suggests that:

d∗ = p∗

Conditions of Strong Duality


There is strong duality under specific presumptions, such as:

• The objective function f (p) and the inequality constraints gi (p) are
convex.

• The equality constraints hi (p) are linear functions, or affine functions.

• There is at least one p such that gi (p) < 0 for all i, indicating that the
constraints are strictly feasible.

Under these conditions, the optimal values are p∗ , s∗ , m∗ such that:

p∗ = d∗ = L(p∗ , s∗ , m∗ ). (24)

14
Additionally, the following Karush-Kuhn-Tucker (KKT) criteria are sat-
isfied by these optimal values:

1. Stationary:


L(p∗ , s∗ , m∗ ) = 0, i = 1, . . . , n. (25)
∂pi


L(p∗ , s∗ , m∗ ) = 0, i = 1, . . . , l. (26)
∂mi

2. Complementary Slackness:

s∗i gi (p∗ ) = 0, i = 1, . . . , k. (27)

3. Primary Feasibility:

gi (p∗ ) ≤ 0, i = 1, . . . , k. (28)

4. Dual Feasibility:
s∗i ≥ 0, i = 1, . . . , k. (29)

Equation (3) is particularly important because it implies that if s∗i > 0,


then the condition gi (p∗ ) = 0 must hold with equality. Since it demonstrates
that only a subset of training data—support vectors—shapes the decision
boundary, support vector machines heavily rely on this knowledge.
Additionally, the KKT criteria provide a basis for determining conver-
gence in algorithms such as Sequential Minimal Optimization (SMO).

7 Kernels
In our earlier discussion of linear regression, we encountered a scenario in
which the input variable represented a home’s living area. We built a cubic

15
function because we considered using features like x2 and x3 to enhance the
regression model. To distinguish between different sets of variables, we define
the ”original” input variable as the input characteristics of the problem, in
this case, the dwelling area. When these characteristics are transformed into
a new set of variables used in the learning process, we refer to them as input
features.
To show this change, we use ϕ(x) to represent the feature mapping that
transforms input attributes into input features. For example, in the given
circumstance:
ϕ(x) = (x, x2 , x3 )

Instead of applying Support Vector Machines (SVMs) directly to the original


input characteristics, we can use the modified features. This entails replacing
the existing algorithm with ϕ(x) for occurrences. We substitute these inner
products with K(x, x′ ) since the algorithm is expressed in terms of inner
products. Thus, the kernel function is defined as:

K(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩

8 Efficiency of Kernel Computation


Given this, one could compute K(x, x′ ) by first evaluating ϕ(x) and then
computing their inner product. Even though this can be computationally
expensive, it can be computed rapidly in many circumstances. When the
vector ϕ(x) is high-dimensional, this method is particularly useful.
Consider the kernel function:

K(x, x′ ) = (xT x′ + 1)2

which expands to:


K(x, x′ ) = 1 + 2xT x′ + (xT x′ )2

16
We find that it corresponds to the feature mapping:

ϕ(x) = (1, 2x, x2 )

This demonstrates that direct computation of K(x, x′ ) only requires O(d)


time, offering significant efficiency improvements.

9 Polynomial Kernels
One commonly used polynomial kernel function is:

K(x, x′ ) = (xT x′ + c)d

which corresponds to a feature mapping:

ϕ(x) = (xd1 , x1d−1 x2 , . . . , xdd )

The kernel function allows transformations into a high-dimensional feature


space efficiently.

10 Kernel Similarity
If ϕ(x) and ϕ(x′ ) are close together in feature space, then K(x, x′ ) will be
large. However, if they are nearly orthogonal, K(x, x′ ) will be small. A
popular kernel for measuring similarity is the Gaussian kernel:

∥x − x′ ∥2
 

K(x, x ) = exp −
2σ 2

This function provides a measure of similarity that approaches 1 when x and


x′ are close, and 0 when they are far apart.

17
11 Applications of Kernel Methods
Kernels have many applications beyond SVMs. For example: - SVMs with
polynomial or Gaussian kernels perform well in digit recognition. - In bioin-
formatics, kernel methods classify strings by counting substring occurrences.
The kernel trick allows algorithms relying on inner products to operate
in high-dimensional spaces efficiently, making kernel methods a cornerstone
of modern machine learning.

12 Regularization and Non-Separable Case


The Support Vector Machine (SVM) has been developed thus far based on
the supposition that the given data is linearly separable. Separability is not
always guaranteed, even though moving data to a higher-dimensional feature
space using the function ϕ typically increases the likelihood of achieving it.
Furthermore, because it can be highly sensitive to outliers, it might not
always be the best option to find a fully separated hyperplane. An ideal
margin classifier is depicted in the left image below; however, the decision
boundary drastically shifts and a classifier with a significantly lower margin
is produced when an outlier is added in the upper-left region (right figure).
In order to fit non-linearly separable datasets and lessen sensitivity to
outliers, we modify our optimization problem by adding ℓ1 regularization as
follows: m
1 2
X
min ||p|| + C ξi (30)
p,b,ξ 2
i=1

with:
y (i) (pT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m (31)

ξi ≥ 0, i = 1, . . . , m. (32)

With this method, some data points can have a functional margin smaller
than 1. If a given example has a margin of 1 − ξi (where ξi > 0), then

18
increasing the objective function by Cξi results in a penalty. The trade-
off between ensuring that the majority of examples maintain a functional
margin of at least one and minimizing ||p||2 , which maximizes the margin, is
controlled by the parameter C.

13 Coordinate Ascent
Consider the problem of finishing the subsequent unconstrained optimization:

max W (s1 , s2 , ..., sm ). (33)

In this case, W is a function of the parameters si without explicitly men-


tioning SVMs. The technique we now introduce is known as coordinate as-
cent, despite the fact that we have previously studied optimization strategies
like gradient ascent and Newton’s method:
Continue until convergence:

1. For i = 1, . . . , m:

sˆi = arg max W (s1 , ..., si−1 , sˆi , si+1 , ..., sm ). (34)
si

In this process, optimization focuses on one variable at a time while hold-


ing the others constant. The simplest implementation uses a sequential up-
date order (s1 , s2 , . . . , sm ), but more advanced versions may prioritize updates
based on their expected impact on the objective function W (s).
When W is set up so that the maximization in each iteration can be
computed rapidly, coordinate ascent proves to be a very effective technique.
Coordinated ascent runs are depicted in the following figure:
A quadratic function under optimization is depicted by the graph’s curves.
The trajectory toward the global maximum is displayed, beginning at an

19
initial point. In particular, since only one variable is altered at a time, each
action runs parallel to one of the axes.

14 The SMO Methodology


To conclude our discussion of SVMs, we present the derivation of the SMO
algorithm. Platt’s original work and additional assignments are among the
more nuanced elements that are left for additional reading.
The following dual form is the optimization problem that needs to be
resolved:
m m m
X 1 X X (i) (j)
max W (s) = si − y y si sj ⟨x(i) , x(j) ⟩ (35)
i=1
2 i=1 j=1

beneath:
0 ≤ si ≤ C, i = 1, . . . , m, (36)
m
X
si y (i) = 0. (37)
i=1

Suppose we have a set of si values that satisfy these requirements. There


is no progress when attempting to maximize W by altering only s1 while
keeping s2 , ..., sm fixed. This results from the equality constraint:
m
X
(1)
s1 y =− si y (i) . (38)
i=2

article
The following is the result of multiplying both sides by y (1) :
m
X
(1)
s1 = −y si y (i) . (39)
i=2

Assuming that y (1) is either +1 or −1, we determine that (y (1) )2 = 1,

20
ensuring that the remaining si values fully define s1 . As a result, under the
limit, s1 cannot be changed independently.
The SMO approach, which maintains constraint satisfaction by updating
at least two variables at once, is motivated by this discovery. The method
works as follows:

1. Select a pair of parameters, si and sj , for updating by employing a


strategy that determines which pair is most likely to improve W (s).

2. Optimize W (s) with respect to si and sj , while holding all other sk


constant (for k ̸= i, j).

In order to determine whether the algorithm has converged, we examine


whether the Karush-Kuhn-Tucker (KKT) criteria hold within a specified tol-
erance level tol, typically between 0.01 and 0.001. Additional implementation
details, including the pseudocode, can be found in Platt’s article.
SMO’s Effective Updates
The ability to update si and sj in a computationally efficient manner
accounts for SMO’s efficiency.
Suppose that the existing set of si values satisfies the criteria. Assume
that we decide to set W (s) in relation to s1 and s2 . Considering the con-
straints: m
X
(1) (2)
s1 y + s2 y = − si y (i) , (40)
i=3

We refer to the right-hand side as a constant ζ:

s1 y (1) + s2 y (2) = ζ. (41)

This constraint restricts s1 and s2 to lie on a line within the feasible region
denoted by 0 ≤ si ≤ C. In order to ensure viability within the [0, C] × [0, C]
box, the permitted values of s2 must also satisfy the lower and upper limits
L and H.

21
Rewriting s1 as a function of s2 :

s1 = (ζ − s2 y (2) )y (1) , (42)

The objective function W (s) can be expressed as:

W (s1 , s2 , . . . , sm ) = W ((ζ − s2 y (2) )y (1) , s2 , . . . , sm ). (43)

Since all other si values are constant, this turns into a quadratic function
in s2 :
as22 + bs2 + c. (44)

Equating the derivative to zero yields the best s2 when the box limits
are ignored. If the resultant number falls outside of [R, U ], it is changed (or
”clipped”) as follows:



 U, if sunclipped
2 >U

snew
2 = sunclipped
2 , if R ≤ sunclipped
2 ≤U (45)


if sunclipped

R,
2 <R

Equation (20) is used to determine snew


1 after snew
2 has been set.
Additional implementation details, such as methods for selecting si , sj ,
and altering the bias term b, are covered in Platt’s article.

22
Conclusion
Support Vector Machines (SVMs) are among the best supervised learning
algorithms, particularly helpful for classification in high-dimensional spaces.
Their main goal is to identify the hyperplane that produces the strongest
generalization on new data by maximally separating classes.
SVMs use convex optimization to identify the best margin classifier and
present the ideas of functional and geometric margins. The kernel trick
is the process of employing kernel functions to efficiently operate in high-
dimensional feature spaces by utilizing Lagrange duality without explicitly
computing the transformation.
SVMs use a regularization parameter and slack variables to handle non-
linearly separable data in order to balance margin size and misclassification.
The dual problem of the SVM can be effectively resolved with the help
of the SMO (Sequential Minimal Optimization) algorithm.

23
Refrences
• Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Springer.

• Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algo-


rithm for optimal margin classifiers. Proceedings of the Fifth Annual
Workshop on Computational Learning Theory (COLT).

• Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine


Learning, 20(3), 273–297.

• Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Sup-


port Vector Machines and Other Kernel-based Learning Methods. Cam-
bridge University Press.

• Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm


for Training Support Vector Machines. Microsoft Research Technical
Report MSR-TR-98-14.

• Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Sup-


port Vector Machines, Regularization, Optimization, and Beyond. MIT
Press.

• Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A Library for Support
Vector Machines. ACM Transactions on Intelligent Systems and Tech-
nology (TIST), 2(3), 1–27.

• Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of


Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.).
Springer.

• Smola, A. J., & Schölkopf, B. (2004). A Tutorial on Support Vector


Regression. Statistics and Computing, 14(3), 199–222.

• Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience.

24

You might also like