0% found this document useful (0 votes)
23 views

Machine Learning

The document discusses support vector machines and their use of maximum margin classifiers and Lagrange multipliers for optimization. It covers analytical geometry, defining maximum margins to separate linearly separable data, using Lagrange multipliers to solve constrained optimization problems like finding the optimal separating hyperplane for support vector machines, and deriving the Lagrange function to optimize the margin.

Uploaded by

Tok Tik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Machine Learning

The document discusses support vector machines and their use of maximum margin classifiers and Lagrange multipliers for optimization. It covers analytical geometry, defining maximum margins to separate linearly separable data, using Lagrange multipliers to solve constrained optimization problems like finding the optimal separating hyperplane for support vector machines, and deriving the Lagrange function to optimize the margin.

Uploaded by

Tok Tik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Machine Learning

Support Vector Machine

Lecturer: Duc Dung Nguyen, PhD.


Contact: [email protected]

Faculty of Computer Science and Engineering


Hochiminh city University of Technology
Contents

1. Analytical Geometry

2. Maximum Margin Classifiers

3. Lagrange Multipliers

4. Non-linearly Separable Data

5. Soft-margin

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 1 / 33


Analytical Geometry
Analytical Geometry

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 2 / 33


Analytical Geometry

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 3 / 33


Maximum Margin Classifiers
Maximum margin classifiers

• Assume that the data are linearly separable


• Decision boundary equation:
y(x) = w.x + b

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 4 / 33


Maximum margin classifiers

• Margin: the smallest distance between the decision boundary and any of the samples.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 5 / 33


Maximum margin classifiers

• Margin: the smallest distance between the decision boundary and any of the samples.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 5 / 33


Maximum margin classifiers

• Support vectors: samples at the two margins.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 6 / 33


Maximum margin classifiers

• Scaling y (support vectors) to be 1 or -1:

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 33


Maximum margin classifiers

• Signed distance between the decision boundary and a sample xn :

y(xn )
||w||

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 8 / 33


Maximum margin classifiers

• Signed distance between the decision boundary and a sample xn :

y(xn )
||w||

• Absolute distance between the decision boundary and a sample xn :

tn .y(xn )
||w||

tn = +1 iff y(xn ) > 0 and tn = −1 iff y(xn ) < 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 8 / 33


Maximum margin classifiers

• Maximum margin:  
1
arg max minn (tn .(w.xn + b))
w,b ||w||
with the constraint:
tn .(w.xn + b) ≥ 1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 9 / 33


Maximum margin classifiers

• To be optimized:
1
arg min kwk2
w,b 2
with the constraint:
tn .(w.xn + b) ≥ 1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 33


Lagrange Multipliers
Optimization using Lagrange multipliers

Joseph-Louis Lagrange born 25 January 1736 – Paris, 10


April 1813; also reported as Giuseppe Luigi Lagrange,
was an Italian Enlightenment Era mathematician and as-
tronomer. He made significant contributions to the fields
of analysis, number theory, and both classical and celestial
mechanics.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 11 / 33


Optimization using Lagrange multipliers

• Problem:
arg max f (x)
x

with the constraint:


g(x) = 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 12 / 33


Optimization using Lagrange multipliers

• Solution is the stationary point of the Lagrange function:

L(x, λ) = f (x) + λ.g(x)

such that:
∂L(x, λ)/∂xn = ∂f (x)/∂xn + λ.∂g(x)/∂xn = 0
and
∂L(x, λ)/∂λ = g(x) = 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 13 / 33


Optimization using Lagrange multipliers

• Example:
f (x) = 1 − u2 − v 2
with the constraint:
g(x) = u + v − 1 = 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 14 / 33


Optimization using Lagrange multipliers

• Lagrange function:

L(x, λ) = f (x) + λ.g(x) = (1 − u2 − v 2 ) + λ.(u + v − 1)


∂L(x, λ)/∂u = ∂f (x)/∂u + λ.∂g(x)/∂u = −2u + λ = 0
∂L(x, λ)/∂v = ∂f (x)/∂v + λ.∂g(x)/∂v = −2v + λ = 0
∂L(x, λ)/∂λ = g(x) = u + v − 1 = 0

• Solution: u = 1/2 and v = 1/2

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 15 / 33


Optimization using Lagrange multipliers

• Example:

f (x) = 1 − u2 − v 2

with the constraint:

g(x) = u + v − 1 = 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 16 / 33


Optimization using Lagrange multipliers

• Problem:
arg max f (x)
x

with the inequality constraint:


g(x) ≥ 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 17 / 33


Optimization using Lagrange multipliers

Solution is the stationary point of the Lagrange function:

L(x, λ) = f (x) + λ.g(x)

such that:
∂L(x, λ)/∂xn = ∂f (x)/∂xn + λ.∂g(x)/∂xn = 0
and
g(x) ≥ 0
λ≥0
λ.g(x) = 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 18 / 33


Optimization using Lagrange multipliers

• To be optimized:
1
arg min kwk2
w,b 2
with the constraint:
tn .(w.xn + b) ≥ 1)
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N

tn .(w.xn + b) − 1 ≥ 0
an ≥ 0
an .(tn .(w.xn + b) − 1) = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 19 / 33
Optimization using Lagrange multipliers

• Lagrange function for maximum margin classifier:


1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N

• Solution for w:
∂(w, b, a)/∂w = 0
X
w= an .tn .xn
n=1..N
X
∂L(w, b, a)/∂b = an .tn = 0
n=1..N

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 20 / 33


Optimization using Lagrange multipliers

• Lagrange function for maximum margin classifier:


1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N

• Solution for a: dual representation to be optimized


X 1 X X
L∗ (a) = an − an .am .tn .tm .xn .xm
2
n=1..N n=1..N m=1..N

with the constraints:


an ≥ 0
X
an .tn = 0
n=1..N

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 21 / 33


Optimization using Lagrange multipliers

• Lagrange function for maximum margin classifier:


1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N

• Solution for a: dual representation to be optimized


X 1 X X
L∗ (a) = an − an .am .tn .tm .xn .xm
2
n=1..N n=1..N m=1..N

Why optimization via dual representation?


• Sparsity: an = 0 if xn is not a support vector.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 22 / 33


Optimization using Lagrange multipliers

• Lagrange function for maximum margin classifier:


1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N

an .(tn .(w.xn + b) − 1) = 0
• Solution for b:
1 X
b= am .tm .xm .xn
|S|
n∈S

where S is the set of support vectors (an 6= 0)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 33


Optimization using Lagrange multipliers

• Classification: X
y(x) = w.x + b = an .tn .xn .x + b
n=1..N

y(x) > 0 → +1
y(x) < 0 → −1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 24 / 33


Non-linearly Separable Data
Kernel trick for non-linearly separable data

• Mapping the data points into a high dimensional feature space.


• Example 1:
• Original space: (x)
• New space: (x, x2 )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 25 / 33


Kernel trick for non-linearly separable data

• Example 2:
• Original space: (u, v)
• New space: ((u2 + v 2 )1/2 , arctan(v/u))

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 26 / 33


Kernel trick for non-linearly separable data

Example 3: XOR function

In1 In2 t
0 0 0
0 1 1
1 0 1
1 1 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 27 / 33


Kernel trick for non-linearly separable data

Example 3: XOR function

In1 In2 In3 Output


0 0 1 1
0 1 0 0
1 0 0 0
1 1 0 1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 28 / 33


Kernel trick for non-linearly separable data

• Classification in the new space:


X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33


Kernel trick for non-linearly separable data

• Classification in the new space:


X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N

• Computational complexity of φ(xn ).φ(x) is high due to the high dimension of φ(.).

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33


Kernel trick for non-linearly separable data

• Classification in the new space:


X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N

• Computational complexity of φ(xn ).φ(x) is high due to the high dimension of φ(.).
• Kernel trick:
φ(xn ).φ(xm ) = K(xn , xm )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33


Kernel trick for non-linearly separable data

• A typical kernel function:


K(u, v) = (1 + u.v)2

√ √ √
φ((u1 .u2 , ..., ud )) = (1, 2u1 , 2u2 , ..., 2ud ,
√ √ √
2u1 .u2 , 2u1 .u3 , ..., 2ud−1 .ud ,
u21 , u22 , ..., u2d )
X X X X
φ(u).φ(v) = 1 + 2 ui .vi + 2 ui .vi .uj .vj + u2i vi2
i=1..d i=1..d−1 j=i+1..d i=1..d

φ(u.φ(v) = K(u, v)
• Is φ(x) guaranteed to be separable?

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 30 / 33


Soft-margin
Soft margin SVM

• Soft-margin SVM: to allow some of the training samples to be misclassified.


• Slack variable: ξ

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 33


Soft margin SVM

• New constraints:
tn .(w.xn + b) ≥ 1 − ξn
ξn ≥ 0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 33


Soft margin SVM

• New constraints:
tn .(w.xn + b) ≥ 1 − ξn
ξn ≥ 0
• To be minimized:
1 X
||w||2 = C ξn
2
n=1..N

C > 0: controls the trade-off between the margin and slack variable penalty

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 33


Summary

• SVM is a sparse kernel method.


• Soft margin SVM is to deal with non-linearly separable data after kernel mapping.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 33 / 33

You might also like