Machine Learning 3
Machine Learning 3
Annalisa Marsico
OWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics
Free University of Berlin
29 April, SoSe 2015
Support Vector Machines (SVMs)
Where = min ∑ − + ∑
Vectors, data points, inner products
Consider (x) = ∑ =< , >=
Where = [3 1] and = [1 2]
x2
w
x
θ
x1
For any two vectors, their dot product (aka inner product) is equal to
product of their lenghts, times the cosine of angle between them
Where = min − +
regularization
term
Solve by taking the derivative wrt w, set to zero..
= +
So: = = +
Linear Regression Primal Form
Learn (x) = ∑ =< , >=
Where = min − +
Solution: = +
Solution: = +
Solution: = +
= < , >
k , =< Φ ,Φ >
Φ∶ Φ ϵF
Kernel Functions
Projected space
Original space Φ∶ (higher dimensional)
u2
. x1 Φ(x2)
.
. x2 .
Φ(x1)
u1
Solution: = +
Solution: = +
= Φ( ) = 2
And use it to train and apply our regression function, never leaving 2D space
= ( , ) = + = ( , )
Implications of the “kernel trick”
• But actually using less computation for the learning phase than
we did in the original space – inverting a 1000 x 1000 matrix instead
of a 1024 x 1024 matrix
Some common kernels
Polynomial of degree d
, =< ∙ >
Polynomial of degree up to d
, =< ∙ + >
Gaussian / Radial kernels (polynomials of all orders – projected
Space has infinite dimensions)
−
, = −
2
Linear kernel
, =< ∙ >
Key points about kernels
• Many learning tasks are framed as optimization problems
++ +
-- -
+ -
+ + -
+ + - -
+ + +
- -- -
+ + - --
+ - -
+ -
Pick the one with the largest margin!
++ +
-- -
+ -
+ + -
+ + - -
+ + +
- -- -
+ + - --
+ - -
+ -
Parametrizing the decision boundary
+ >0 + <0
++ +
-- -
+ -
+ + -
+ + - -
+ +
- -- -
+
+ + - --
+ - -
+ -
ɣ Margin = Distance
++ + ɣ
of closest examples
-- - from the decision
+ -
+ + - line / hyperplane
+ + - - Margin = = /
+ +
- -- -
+
+ + - --
+ - -
+ -
+ +
- -- -
+ Margin = = /
+ + - --
+ - -
+ -
Labels ϵ −1, +1 - class
Maximizing the margin corresponds to minimize ||w|| !
SVM: Maximize the margin
+ ɣ ɣ Margin = = /
++
-- -
+ -
+ + - , = /
+ + - -
+ +
- -- - s.t. + ≥
+
+ + - --
+ - - Note: ‘a’ is arbitrary (we can
+ -
normalize equations by a)
+ ɣ ɣ
++ -- - s.t. + ≥1
+
+ + - -
+ + - - Primal form
+ + + - -- - ,
+ + - --
+ - - s.t. + ≥1
+ - Solve efficiently by quadratic
Programming (QP)
- Well-studied solution
algorithms
,
s.t. + ≥ 1 for all j training examples
,… < , >
+ >0 + <0
+ ɣ ɣ
++
-- - Linear hyperplane
+
- -
defined by support vectors
+ +
+ + - - Moving other points a little
+ +
- -- - doesn’t change the decision
boundary
+
+ + - -- Only need to store the
+ - - Support vectors to predict
labels of new points
+ -
“Hard margin” Support Vector Machine
Kernel SVMs
Because the dual form only depends on dot products, we can apply the
Kernel trick to work in a (virtual) projected space Ф : X F
,
s.t. Φ( ) + ≥ 1 for all j training examples
,… < , >
= Φ +
x2
x1
Circled points are the support vectors: training examples with non-zero
Points plotted in original 2D space
Contour lines correspond to
−
= + k( , )=b+ −
2
∈ ∈
SVMs with Soft Margin
Allow errors in classification
, + # mistakes
++ + s.t. Φ( ) + ≥1
-+ -- - for all j training examples
+ + + - -
+ + -
- -- -
+ +- + - Maximize margin and minimize
+ + - -- the number of mistakes on
+ -+ - training data
+ - C – tradeoff parameter
Not QP
Treats all errors equally
What if the data are not linearly
separable?
Allow errors in classification
, + ∑
++ + s.t. Φ( ) + ≥1−
-+ -- - for all j training examples
+ + + - -
+ + -
- -- -
+ +- + - = ‘slack’ variable
+ + - -- (>1 if mis-classified)
+ -+ - Pay linear penalty for mistakes
+ - C – tradeoff parameter
Still QP
Variable selection with SVMs
Forward Selection: all features are tried separately and the one performing the best
∗
is retained. Then, all remaining features are added in turn and the best pair ( ∗ , ∗ )
is retained. Then all the remaining features are added in turn and the best trio { ∗ , ∗ , ∗
}
is retained. And so on until the performance stops increasing or until all features
have been exhausted.
While ≠ {}
- train a SVM on D (training set) using cross-validation to tune parameters
- p is the performance obtained with best set of parameters
- if p - p* > t
- p* = p, oldF = F
- for each feature f in F
- compute the difference in performance
- discard the feature that leads to smallest difference
- else {F = oldF)
Output the features that are left in F
SVM Summary
• Objective: maximize margin between decision surface and data
• Kernel SVM’s
• Learn linear decision in high dimension space, working in original
low dimension space
• Protein localization