0% found this document useful (0 votes)
2 views

Machine Learning 3

The document provides an overview of Support Vector Machines (SVMs) and their applications in machine learning, particularly in bioinformatics. It discusses the concepts of maximizing margins, kernel functions, and the dual formulation of SVMs, emphasizing the computational efficiencies gained through the 'kernel trick'. Additionally, it highlights various applications of SVMs in gene function prediction, cancer classification, and protein analysis.

Uploaded by

roshjames60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning 3

The document provides an overview of Support Vector Machines (SVMs) and their applications in machine learning, particularly in bioinformatics. It discusses the concepts of maximizing margins, kernel functions, and the dual formulation of SVMs, emphasizing the computational efficiencies gained through the 'kernel trick'. Additionally, it highlights various applications of SVMs in gene function prediction, cancer classification, and protein analysis.

Uploaded by

roshjames60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Applied Machine Learning

Annalisa Marsico
OWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics
Free University of Berlin
29 April, SoSe 2015
Support Vector Machines (SVMs)

1. One of the most widely used, successful approaches


to train a classifier
2. Based on new idea of maximizing the margin as
objective function
3. Based on the idea of kernel functions
Kernel Regression
Linear Regression
We wish to learn f: X -> Y , where X = <X1,….Xp>, Y real-valued,
p = number of features

Learn (x) = ∑ =< , >=

Where = min ∑ − + ∑
Vectors, data points, inner products
Consider (x) = ∑ =< , >=
Where = [3 1] and = [1 2]

x2

w
x
θ
x1

For any two vectors, their dot product (aka inner product) is equal to
product of their lenghts, times the cosine of angle between them

< , >=∑ = cos


Linear Regression Primal Form
Learn (x) = ∑ =< , >=

Where = min − +
regularization
term
Solve by taking the derivative wrt w, set to zero..
= +

So: = = +
Linear Regression Primal Form
Learn (x) = ∑ =< , >=

Where = min − +

Solution: = +

Interesting observation: w lies in the space spanned by


training examples (why?)
Linear Regression Dual Form
Learn (x) = ∑ =< , >=
Where = min − +

Solution: = +

Dual form use fact that: =∑

Learn =∑ < , >

Solution: = +

A lot of dot products..


Key ingredients of Dual Solution
Step 1: Compute
= +

Where = that is =< , >

Step 2: Evaluate on new point x by

= < , >

Important observation: both steps only involve inner products


between input data points
Kernel Functions
Since the computation only involves dot products , we can substitute
for all occurrences of <. , .> a kernel function k that computes:

k , =< Φ ,Φ >

Φ is a function from the current space to a feature


(higher-dimensional space) F defined by the mapping:

Φ∶ Φ ϵF
Kernel Functions
Projected space
Original space Φ∶ (higher dimensional)
u2
. x1 Φ(x2)
.
. x2 .
Φ(x1)

u1

What the kernel function k does is to give me some other operation (


in the original space) which is equivalent to compute dot products into
the higher dimensional space
, =< Φ ,Φ > :
Linear Regression Dual Form
Learn (x) = ∑ =< , >=
Where = min − +

Solution: = +

Dual form use fact that: =∑

Learn =∑ < , >

Solution: = +

By doing that we gain computational complexity!


Example: Quadratic kernel
Suppose we have data originally in 2D, but project it into 3D using Ф(x)

= Φ( ) = 2

This converts our linear regression problem into quadratic regression!


But we can use the following kernel function to calculate dot products in
the projected 3D space, in terms of operations in the 2D space

< Φ( ), Φ >=< , > ≝ ( , )

And use it to train and apply our regression function, never leaving 2D space

= ( , ) = + = ( , )
Implications of the “kernel trick”

• Consider for example computing a regression function over 1000 images


represented by pixel vectors – 32 x 32 = 1024 pixels

• By using the quadratic kernel we implement the regression function


in a 1,000,000 dimensional space

• But actually using less computation for the learning phase than
we did in the original space – inverting a 1000 x 1000 matrix instead
of a 1024 x 1024 matrix
Some common kernels
Polynomial of degree d
, =< ∙ >
Polynomial of degree up to d
, =< ∙ + >
Gaussian / Radial kernels (polynomials of all orders – projected
Space has infinite dimensions)

, = −
2
Linear kernel
, =< ∙ >
Key points about kernels
• Many learning tasks are framed as optimization problems

• Primal and Dual formulation of optimization problems

• Dual version framed in terms of dot products between x’s

• Kernel functions k(x,z) allow calculating dot products <Ф(x), Ф(z)>


without actually projecting x into Ф(x)

• Leads to major efficiencies, and ability to use very high dimensional


(virtual) feature spaces

• We can learn non-linear functions


Kernel-Based Classifiers
Linear Classifier – Which line is better?

++ +
-- -
+ -
+ + -
+ + - -
+ + +
- -- -
+ + - --
+ - -
+ -
Pick the one with the largest margin!

++ +
-- -
+ -
+ + -
+ + - -
+ + +
- -- -
+ + - --
+ - -
+ -
Parametrizing the decision boundary

+ >0 + <0
++ +
-- -
+ -
+ + -
+ + - -
+ +
- -- -
+
+ + - --
+ - -
+ -

Labels ϵ −1, +1 - class


Maximizing the margin

ɣ Margin = Distance
++ + ɣ
of closest examples
-- - from the decision
+ -
+ + - line / hyperplane
+ + - - Margin = = /
+ +
- -- -
+
+ + - --
+ - -
+ -

Labels ϵ −1, +1 - class


Maximizing the margin
+ ɣ ɣ
++
-- - Margin = Distance
+
- -
of closest examples
+ + from the decision
+ + - - line / hyperplane

+ +
- -- -
+ Margin = = /
+ + - --
+ - -
+ -
Labels ϵ −1, +1 - class
Maximizing the margin corresponds to minimize ||w|| !
SVM: Maximize the margin

+ ɣ ɣ Margin = = /
++
-- -
+ -
+ + - , = /
+ + - -
+ +
- -- - s.t. + ≥
+
+ + - --
+ - - Note: ‘a’ is arbitrary (we can
+ -
normalize equations by a)

Labels ϵ −1, +1 - class


Support Vector Machine (primal form)
, = 1/

+ ɣ ɣ
++ -- - s.t. + ≥1
+
+ + - -
+ + - - Primal form
+ + + - -- - ,

+ + - --
+ - - s.t. + ≥1
+ - Solve efficiently by quadratic
Programming (QP)
- Well-studied solution
algorithms

Non-kernelized version of SVMs !


SVMs (from primal form to dual form)

• With kernel regression we had to go from the primal form of our


optimization problem to the dual version of it
-> expressed in a way that we only need to compute dot products

• We do the same for SVMs

• All things which apply to kernel regression apply to SVM’s


• But with a different objective function: the margin
SVMs (from primal form to dual form)
Primal form: solve for w, b

,
s.t. + ≥ 1 for all j training examples

Classification test for new x: + >0

Dual form: solve for α1, ......, αN

,… < , >

s.t. ≥ 0 and for all j training examples ∑ =0

Classification test for new x ∑ ∈ < , >+ ≥0


Support Vectors
∑∈ < , >+ >0 ∑∈ < , >+ <0

+ >0 + <0
+ ɣ ɣ
++
-- - Linear hyperplane
+
- -
defined by support vectors
+ +
+ + - - Moving other points a little

+ +
- -- - doesn’t change the decision
boundary
+
+ + - -- Only need to store the
+ - - Support vectors to predict
labels of new points
+ -
“Hard margin” Support Vector Machine
Kernel SVMs
Because the dual form only depends on dot products, we can apply the
Kernel trick to work in a (virtual) projected space Ф : X F

Primal form: solve for w, b in the projected higher dim. space

,
s.t. Φ( ) + ≥ 1 for all j training examples

Classification test for new x: Φ( ) + >0

Dual form: solve for α1, ......, αN

,… < , >

s.t. ≥ 0 and for all j training examples ∑ =0

Classification test for new x ∑ ∈ < , >+ ≥0


SVM decision surface using Gaussian
Kernel

= Φ +

x2

x1
Circled points are the support vectors: training examples with non-zero
Points plotted in original 2D space
Contour lines correspond to

= + k( , )=b+ −
2
∈ ∈
SVMs with Soft Margin
Allow errors in classification
, + # mistakes

++ + s.t. Φ( ) + ≥1
-+ -- - for all j training examples
+ + + - -
+ + -
- -- -
+ +- + - Maximize margin and minimize
+ + - -- the number of mistakes on
+ -+ - training data
+ - C – tradeoff parameter
Not QP
Treats all errors equally
What if the data are not linearly
separable?
Allow errors in classification
, + ∑

++ + s.t. Φ( ) + ≥1−
-+ -- - for all j training examples
+ + + - -
+ + -
- -- -
+ +- + - = ‘slack’ variable
+ + - -- (>1 if mis-classified)
+ -+ - Pay linear penalty for mistakes
+ - C – tradeoff parameter

Still QP 
Variable selection with SVMs
Forward Selection: all features are tried separately and the one performing the best

is retained. Then, all remaining features are added in turn and the best pair ( ∗ , ∗ )
is retained. Then all the remaining features are added in turn and the best trio { ∗ , ∗ , ∗
}
is retained. And so on until the performance stops increasing or until all features
have been exhausted.

Pseudocode: F full set of features, S=selected features={}, p=curr performance=0,


oldp = previous performance=-1, p*=best performance=0

While ≠ {} and while >


- for each feature f in F
- for k in [1…k] folds # cross-validation
- split D into T(training) and V(validation)
- train a model M on T using features S U f
- compute the performance of M on V
- compute the average performance over k folds
- choose the feature ∗ that leads to best performance ∗
- if ∗ > , then = , = ∗ , = / ∗ , else stop
Output features in order of importance
Variable selection with SVMs
Recursive feature elimination: at first all features are used to train a SVM
The margin γ is computed. Then, for each feature f, a new margin is computed using the
feature set F’=F/{f} and the margin is updated to γ’ . The feature f leading to the smallest
difference between γ and γ’ is considered least valuable and is discarded. The process is
Repeated until the performance starts degrading.

Pseudocode: F full set of features, S=selected features={}, p=curr performance=0,


p*=best performance=0; t = threshold on p

While ≠ {}
- train a SVM on D (training set) using cross-validation to tune parameters
- p is the performance obtained with best set of parameters
- if p - p* > t
- p* = p, oldF = F
- for each feature f in F
- compute the difference in performance
- discard the feature that leads to smallest difference
- else {F = oldF)
Output the features that are left in F
SVM Summary
• Objective: maximize margin between decision surface and data

• Primal and dual formulations:


• Dual represents classifier decision in terms of support vector

• Kernel SVM’s
• Learn linear decision in high dimension space, working in original
low dimension space

• Handling noisy data: soft margin ‘slack variables’


• again primal and dual forms

• SVM algorithm: Quadratic Programming Optimization


• single global minimum
Applications of SVMs in
Bioinformatics
• Gene function prediction (from microarray data, RNA-seq)

• Cancer tissue classification

• Remote homology detection in proteins (structure & sequence features)

• Translation initiation site recognition in DNA (from distal sequences)

• Promoter prediction (from sequence alone or other genomic features)

• Protein localization

• Virtual screening of small molecules

You might also like