SVM-ML-AI_lecturenotes_cs725
SVM-ML-AI_lecturenotes_cs725
Dual Formulation
m
!
X
d∗ = max min f (x) + λi gi (x) (53)
λ∈R x∈D
i=1
s.t. λi ≥ 0 (54)
Equation 53 is and convex optimization problem. Also, d∗ ≤ p∗ and (p∗ − d∗ ) is called the duality
gap.
If for some (x∗ , λ∗ ) where x∗ is primal feasible and λ∗ is dual feasible and we see the KKT
conditions are satisfied and f is and all gi are convex then x∗ is optimal solution to primal and λ∗
to dual.
Also, the dual optimization problem becomes,
It happens to be,
p ∗ = d∗ (61)
111
21 SUPPORT VECTOR MACHINES 112
KKT 1.a
∇w L = 0 (63)
Xn
=⇒ w − αj yj φT (xj ) = 0 (64)
j=1
KKT 1.b
∇xii L = 0 (65)
=⇒ c − αi − λi = 0 (66)
KKT 1.c
∇w 0 L = 0 (67)
Xn
=⇒ αi yi = 0 (68)
i=1
KKT 2
∀i (69)
T
yi φ (xi )w + w0 ≥ 1 − ξi (70)
ξi ≥ 0 (71)
KKT 3
αj ≥ 0 and λk ≥ 0 (72)
∀j, k = 1, . . . , n (73)
KKT 4
αj yi φT (xj )w + w0 − 1 + ξj = 0
(74)
λk ξk = 0 (75)
(a)
m
X
w∗ = αj yi φ(xj ) (76)
j=1
21 SUPPORT VECTOR MACHINES 113
(b)
subject to constraint,
∀i : yi (φT (xi )w + w0 ) ≥ 1 − ξi
The dual of the SVM optimization problem can be stated as,
m m m
1 XX X
max{− yi yj αi αj φT (xi )φ(xj ) + αj }
2 i=1 j=1 j=1
subject to constraints,
X
∀i : αi yi = 0
i
∀i : 0 ≤ αi ≤ c
The duality gap = f (x∗ ) − L∗ (λ∗ ) = 0, as shown in last lecture. Thus, as is evident from the
solution of the dual problem,
m
X
w∗ = αi∗ yi φ(xi )
i=1
To obtain wo∗ , we can use the fact (as shown in last lecture) that, if αi ∈ (0, C), yi (φT (xi )w +
w0 ) = 1. Thus, for any point xi such that, αi ∈ (0, C), that is, αi is a point on the margin,
1 − yi (φT (xi )w∗ )
wo∗ =
yi
= yi − φ (xi )w∗
T
w0∗ = yi − φT (xi )w
Xm
= yi − αj∗ yj φT (xi )φ(xj )
j=0
m
X
= yi − αj∗ yj Kij
j=0
Generation of φ space
For a given x = [x1 , x2 , . . . , xn ] → φ(x) = [xd1 , xd2 , xd3 , . . . , x1d−1 x2 , . . . ].
For n = 2, d = 2, φ(x) = [x21 , x1 x2 , x2 x1 , x22 ], thus,
m X
X m
φT (x).φ(x̄) = xi xj .x̄i x̄j
i=1 j=1
Xm m
X
= ( xi x̄i ).( xj x̄j )
i=1 j=1
Xm
= ( xi x̄i )2
i=1
= (xT x̄)2
3. Positivity of Diagonal
K = V ΛV T
Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diagonal matrix of
eigen values.
Hence K must be
1. Symmetric.
2. Positive Semi Definite.
3. Having non-negative Diagonal Entries.
115
21 SUPPORT VECTOR MACHINES 116
Examples of Kernels
d
1. Kij = (xi T xj )
d
2. Kij = (xi T xj + 1)
3. Gaussian or Radial basis Function (RBF)
kxi −xj k
Kij = e− 2σ 2 (σ ∈ R, σ 6= 0)
4. The Hyperbolic Tangent function
Kij = tanh(σxTi xj + c)
Define φ(xi ) = φ0T (x0i )φ00T (x00i ). Thus, Kij = φ(xi )φ(xj ).
Hence, K is a valid kernel.
X 1
min − α1 − α2 − αi + α12 K11 + α22 K22 + α1 α2 K12 y1 y2
α1 ,α2 2
i6=1,2
X X
+ α1 y1 K1i αi yi + α2 y2 K2i αi yi (77)
i6=1,2 i6=1,2
X
s. t. α1 y1 + α2 y2 = − αj yj = α1old + α2old
j6=1,2
α1 , α2 ∈ [0, c]
Then the objective is just a function of α2 , let the objective is −D(α2 ). Now the program
reduces to
min − D(α2 )
α2
s. t. α2 ∈ [0, c]
5. case 1: y1 = y2
case 2: y1 = −y2
If α2 is already in the interval then there is no problem. If it is more than the maximum
limit then reset it to the maximum limit. This will ensure the optimum value of the objective
constrained to this codition. Similarly if α2 goes below the lower limit then reset it to the
lower limit.
21 SUPPORT VECTOR MACHINES 118
X 1 XX
Dual: min − αi + αi αj yi yj Kij (78)
α 2 i j
X
s. t. αi yi = 0
i
αi ∈ [0, c]
The above program is a quadratic program. Any quadratic solvers can be used for solving (78),
but a generic solver will not take consider speciality of the solution and may not be efficient. One
way to solve (78) is by using projection methods(also called Kernel adatron). You can solve the
above one using two ways - chunking methods and decomposition methods.
The chunking method is as follows
1. Initialize αi s arbitrarily
2. Choose points(I mean the components αi ) that violate KKT condition
3. Consider only K working set and solve the dual for the variables in working set
∀α ∈ working set
X 1 X X
min − αi + αi αj yi yj Kij (79)
α 2
αi inW S i∈W S j∈W S
X X
s. t. αi yi = − αj yj
i∈W S j ∈W
/ S
αi ∈ [0, c]
Decompsition methods follow almost the same procedure except that in step 2 we always take
a fixed number of points which violate the KKT conditions the most.
Further Reading
For SVMs in general and kernel method in particular read the SVM book An Introduction to
Support Vector Machines and Other Kernel-based Learning Methods by Nello Cristianini and John
Shawe-Taylor uploaded on moodle.
CS 725 : Foundations of Machine Learning Autumn 2011
119