0% found this document useful (0 votes)
6 views9 pages

SVM-ML-AI_lecturenotes_cs725

The document discusses the dual formulation of Support Vector Machines (SVM) and the conditions for optimality using the Karush-Kuhn-Tucker (KKT) conditions. It details the formulation of the dual optimization problem, the requirements for kernel functions, and introduces algorithms like Sequential Minimal Optimization (SMO) for solving the dual. Additionally, it covers properties of kernel functions and examples of different kernels used in SVM.

Uploaded by

sai bharadwaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

SVM-ML-AI_lecturenotes_cs725

The document discusses the dual formulation of Support Vector Machines (SVM) and the conditions for optimality using the Karush-Kuhn-Tucker (KKT) conditions. It details the formulation of the dual optimization problem, the requirements for kernel functions, and introduces algorithms like Sequential Minimal Optimization (SMO) for solving the dual. Additionally, it covers properties of kernel functions and examples of different kernels used in SVM.

Uploaded by

sai bharadwaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS 725 : Foundations of Machine Learning Autumn 2011

Lecture 27: SVM: Dual Formulation, Notion of Kernel


Instructor: Ganesh Ramakrishnan Date: 05/11/2011
Computer Science & Engineering Indian Institute of Technology, Bombay

21.2 SVM : Dual Formulation


Primal formulation

p∗ = min f (x) (49)


x∈D (50)
s.t. gi (x) ≤ 0 (51)
i = 1, . . . , m (52)

Dual Formulation

m
!
X
d∗ = max min f (x) + λi gi (x) (53)
λ∈R x∈D
i=1
s.t. λi ≥ 0 (54)

Equation 53 is and convex optimization problem. Also, d∗ ≤ p∗ and (p∗ − d∗ ) is called the duality
gap.
If for some (x∗ , λ∗ ) where x∗ is primal feasible and λ∗ is dual feasible and we see the KKT
conditions are satisfied and f is and all gi are convex then x∗ is optimal solution to primal and λ∗
to dual.
Also, the dual optimization problem becomes,

d∗ = max L(x∗ , λ) (55)


λ∈Rm
s.t. λi ≥ 0∀i (56)
m
X
where L(x, λ) = f (x) + λi gi (x) (57)
i=1
L∗ (λ) = min L(x, λ) (58)
x∈D
= min L(x, λ) (59)
x∈KKT
λi ≥ 0∀i (60)

It happens to be,

p ∗ = d∗ (61)

111
21 SUPPORT VECTOR MACHINES 112

21.3 Duality theory applied to KKT


m m m
¯ w0 , ᾱ, λ̄) = 1 ||w||2 + c
X X  X
αi 1 − ξi − yi φT (xi )w + w0 −

L(w̄, ξ, ξi + λi ξi (62)
2 i=1 i=1 i=1

Now we check for KKT conditions at the point of optimality,

KKT 1.a

∇w L = 0 (63)
Xn
=⇒ w − αj yj φT (xj ) = 0 (64)
j=1

KKT 1.b

∇xii L = 0 (65)
=⇒ c − αi − λi = 0 (66)

KKT 1.c

∇w 0 L = 0 (67)
Xn
=⇒ αi yi = 0 (68)
i=1

KKT 2

∀i (69)
T

yi φ (xi )w + w0 ≥ 1 − ξi (70)
ξi ≥ 0 (71)

KKT 3

αj ≥ 0 and λk ≥ 0 (72)
∀j, k = 1, . . . , n (73)

KKT 4

αj yi φT (xj )w + w0 − 1 + ξj = 0
  
(74)
λk ξk = 0 (75)

(a)

m
X
w∗ = αj yi φ(xj ) (76)
j=1
21 SUPPORT VECTOR MACHINES 113

w∗ is weighted linear combination of points φ(x)s.

(b)

If 0 < αj < c then, by Equation 66 


0 < λj < c and by Equation 75, ξj = 0 and yi φT (xj )w + w0 = 1

If however, αj = c then λj = 0 and yi φT (xj )w + w0 ≤ 1.

If α0 then λj = c and ξj = 0, we get yi φT (xj )w + w0 ≥ 1. Then αj = 0

21.4 SVM dual


SVM can be formulated as the following optimization problem,
m
1 2
X
min{ kwk + C ξi }
w 2 i=0

subject to constraint,
∀i : yi (φT (xi )w + w0 ) ≥ 1 − ξi
The dual of the SVM optimization problem can be stated as,
m m m
1 XX X
max{− yi yj αi αj φT (xi )φ(xj ) + αj }
2 i=1 j=1 j=1

subject to constraints,
X
∀i : αi yi = 0
i
∀i : 0 ≤ αi ≤ c
The duality gap = f (x∗ ) − L∗ (λ∗ ) = 0, as shown in last lecture. Thus, as is evident from the
solution of the dual problem,
m
X
w∗ = αi∗ yi φ(xi )
i=1

To obtain wo∗ , we can use the fact (as shown in last lecture) that, if αi ∈ (0, C), yi (φT (xi )w +
w0 ) = 1. Thus, for any point xi such that, αi ∈ (0, C), that is, αi is a point on the margin,
1 − yi (φT (xi )w∗ )
wo∗ =
yi
= yi − φ (xi )w∗
T

The decision function,


g(x) = φT (x)w∗ + w0∗
Xm
= αi yi φT (x)φ(xi ) + w0∗
i=0
21 SUPPORT VECTOR MACHINES 114

21.5 Kernel Matrix


A kernel matrix
 
φT (x1 )φ(x1 ) φT (x1 )φ(x2 ) ... ... φT (x1 )φ(xn )
 T
 φ (x2 )φ(x1 ) φT (x2 )φ(x2 ) . . . ... φT (x2 )φ(xn ) 

 
K=
 ... ... ... ... ... 

. . . . . . ... ... ...
 
 
T T
φ (xn )φ(x1 ) φ (xn )φ(x2 ) . . . ... φT (xn )φ(xn )
In other words, Kij = φT (xi )φ(xj ). The SVM dual can now be re-written as,
1
max{− αT Ky α + αT ones(m, 1)}
2
subject to constraints,
X
αi yi = 0
i
0 ≤ αi ≤ c

Thus, for αi ∈ (0, C)

w0∗ = yi − φT (xi )w
Xm
= yi − αj∗ yj φT (xi )φ(xj )
j=0
m
X
= yi − αj∗ yj Kij
j=0

Generation of φ space
For a given x = [x1 , x2 , . . . , xn ] → φ(x) = [xd1 , xd2 , xd3 , . . . , x1d−1 x2 , . . . ].
For n = 2, d = 2, φ(x) = [x21 , x1 x2 , x2 x1 , x22 ], thus,
m X
X m
φT (x).φ(x̄) = xi xj .x̄i x̄j
i=1 j=1
Xm m
X
= ( xi x̄i ).( xj x̄j )
i=1 j=1
Xm
= ( xi x̄i )2
i=1
= (xT x̄)2

In general, for n ≥ 1 and d ≥ 1, φT (x).φ(x̄) = (xT x̄)d .


A polynomial kernel, in general, is defined as Kij = (xTi xj )d .
CS 725 : Foundations of Machine Learning Autumn 2011

Lecture 28: SVM: Kernel Methods, Algorithms for solving Dual


Instructor: Ganesh Ramakrishnan Date: 07/11/2011
Computer Science & Engineering Indian Institute of Technology, Bombay

21.6 Requirements of Kernel


1. Since

Kij = φT (xi )φ(xj )


= φT (xj )φ(xi )

Hence K should be a Symmetric Matrix.


2. The Cauchy Schwarz Inequality

(φT (x)φ(x̄))2 ≤ kφT (x)k2 kφ(x̄)k2

⇒ Kij 2 ≤ Kii Kjj

3. Positivity of Diagonal

K = V ΛV T

Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diagonal matrix of
eigen values.

Goal is to construct a φ. Which can be constructed as



φ(xi ) = λi Vi (λi ≥ 0)
2
Kii = λi kVi k

Hence K must be
1. Symmetric.
2. Positive Semi Definite.
3. Having non-negative Diagonal Entries.

115
21 SUPPORT VECTOR MACHINES 116

Examples of Kernels
d
1. Kij = (xi T xj )
d
2. Kij = (xi T xj + 1)
3. Gaussian or Radial basis Function (RBF)
kxi −xj k
Kij = e− 2σ 2 (σ ∈ R, σ 6= 0)
4. The Hyperbolic Tangent function
Kij = tanh(σxTi xj + c)

Properties of Kernel Functions


If K 0 and K 00 are Kernels then K is also a Kernel if either of the following holds
1. Kij = K 0 ij + K 00 ij
2. Kij = αK 0 ij (α ≥ 0)
3. Kij = K 0 ij K 00 ij
Proof : (1) and (2) are left as an exercise.
(3)
0 00
Kij = Kij Kij
= φ0T (x0i )φ0 (x0j ) ∗ φ00T (x00i )φ00 (x00j )

Define φ(xi ) = φ0T (x0i )φ00T (x00i ). Thus, Kij = φ(xi )φ(xj ).
Hence, K is a valid kernel.

21.7 Algorithms for solving the dual


Duality offers multiple alternative check points to see if the solution is optimal. They are
1. KKT conditions satisfied ∀i
2. Primal objective ≈ Dual objective
We prefer solving the dual since we have the kernel and can avoid computing complex φ.
(K = xT x̄ i.e φ(x) = x .. However, linear kernel has simple φ and could be solved in primal form)

Sequential Minimal Optimization Algorithm (SMO)


It turns out that for most solutions, most αi = 0. So general (LCQP) solvers are an overkill. To
explot this, we use batch co-ordinate wise ascent. One of the best performers is the sequential
minimal optimization (SMO) algorithm.
This optimizes for 2 α’s at a time. The steps of the algorithm are:
1. Start with all αi = 0
21 SUPPORT VECTOR MACHINES 117

2. Seclect any 2 αs, say α1 and α2 that violate the KKT


3. Solve for α1 and α2

X 1
min − α1 − α2 − αi + α12 K11 + α22 K22 + α1 α2 K12 y1 y2
α1 ,α2 2
i6=1,2
X X
+ α1 y1 K1i αi yi + α2 y2 K2i αi yi (77)
i6=1,2 i6=1,2
X
s. t. α1 y1 + α2 y2 = − αj yj = α1old + α2old
j6=1,2

α1 , α2 ∈ [0, c]

4. From the second last constraint, we can write α1 in terms of α2 .


y2 y2
α1 = −α2 + α1old + α2old
y1 y1

Then the objective is just a function of α2 , let the objective is −D(α2 ). Now the program
reduces to

min − D(α2 )
α2

s. t. α2 ∈ [0, c]

Find α2∗ such that ∂D(α


∂α2
2)
= 0. We have to ensure that α1 ∈ [0, c]. So based on that we will
have to clipp α2 , ie, shift it to certain interval. The condition is as follows
y2 y2
0 <= −α2 + α1old + α2old <= c
y1 y1

5. ˆ case 1: y1 = y2

α2 ∈ [max(0, −c + α1old + α2old ), min(c, α1old + α2old )]

ˆ case 2: y1 = −y2

α2 ∈ [max(0, α2old − α1old ), min(c, c − α1old + α2old )]

If α2 is already in the interval then there is no problem. If it is more than the maximum
limit then reset it to the maximum limit. This will ensure the optimum value of the objective
constrained to this codition. Similarly if α2 goes below the lower limit then reset it to the
lower limit.
21 SUPPORT VECTOR MACHINES 118

Chunking and Decomposition Methods


We are interested in solving dual of the objective because we have already seen that most of the
dual variable will be zero in the solution and hence it will give a sparse solution (based on the KKT
conidtion).

X 1 XX
Dual: min − αi + αi αj yi yj Kij (78)
α 2 i j
X
s. t. αi yi = 0
i
αi ∈ [0, c]

The above program is a quadratic program. Any quadratic solvers can be used for solving (78),
but a generic solver will not take consider speciality of the solution and may not be efficient. One
way to solve (78) is by using projection methods(also called Kernel adatron). You can solve the
above one using two ways - chunking methods and decomposition methods.
The chunking method is as follows

1. Initialize αi s arbitrarily
2. Choose points(I mean the components αi ) that violate KKT condition

3. Consider only K working set and solve the dual for the variables in working set

∀α ∈ working set
X 1 X X
min − αi + αi αj yi yj Kij (79)
α 2
αi inW S i∈W S j∈W S
X X
s. t. αi yi = − αj yj
i∈W S j ∈W
/ S

αi ∈ [0, c]

4. set αnew = [αW


new old
S , αnonW S ]

Decompsition methods follow almost the same procedure except that in step 2 we always take
a fixed number of points which violate the KKT conditions the most.

Further Reading
For SVMs in general and kernel method in particular read the SVM book An Introduction to
Support Vector Machines and Other Kernel-based Learning Methods by Nello Cristianini and John
Shawe-Taylor uploaded on moodle.
CS 725 : Foundations of Machine Learning Autumn 2011

Lecture 29: Support Vector Regression, Attribute Selection


Instructor: Ganesh Ramakrishnan Date: 11/11/2011
Computer Science & Engineering Indian Institute of Technology, Bombay

22 Support Vector Regression


Please refer to previous years’ notes (http://www.cse.iitb.ac.in/~cs725/notes/classNotes/
lecturenote_2010.pdf) Section 22.2 for this topic.

23 Attribute Selection and Transformation


Please refer to the following material for this topic:

1. Chapter 7 of book Data Mining by I.H. Witten and E. Frank


2. Slides at http://www.cse.iitb.ac.in/~cs725/notes/classNotes/dataprocessing.pdf

119

You might also like