Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
A.A. 2020/2021
Tatiana Tommasi
E-mail*: [email protected]
2020/21
2
Supervised Learning with SVM
Summary :
● differently from the perceptron: batch learning (not online) and here we include a large margin
condition. So not all the linear classifiers are equally good.
● The best classifier is described in a sparse way -- only the support vectors are relevant, no need to
keep the whole set of training data (fundamental difference wrt KNN).
2020/21
3
We can do more beside classification
● Support Vector Data Description (SVDD) and One class SVM (OSVM) - two different strategies to
identify outliers
Unsupervised Learning
2020/21
4
Outlier / Novelty Detection
Inherently binary problem but with
peculiarities
Applications
intrusion detection, fraud detection, fault
detection, robotics, medical diagnosis,
e-commerce, and more...
Figure Credit: Refael Chickvashvili
2020/21
5
SVDD [Tax and Duin, ML 2004]
● Find the minimal circumscribing sphere around the data in the
feature space
● Allow slacks (most of the points should lie inside the sphere, not all)
controls how many outliers are possible: the smaller ν the fewer points will be
outliers, since violating the constraint above will be penalized more heavily the
optimization
2020/21
6
SVDD [Tax and Duin, ML 2004]
● Find the minimal circumscribing sphere around the data in the
feature space
● Allow slacks (most of the points should lie inside the sphere, not all)
Post-Video Note:
in the video Silvia mentions “Parzen Windows”. That is a non-parametric density estimation method.
We will not go over that method in the course this year. Still, if you are interested you can check the
following references
Unbounded Support Vectors
Chapter 2.5.1 in “Pattern Recognition and Machine Learning” book by Bishop.
Chapter 4 in “Pattern Classification” book by Duda, Hart, Stork Bounded Support Vectors
controls how many outliers are possible: the smaller ν the fewer points will be
outliers, since violating the constraint above will be penalized more heavily the
optimization
2020/21
7
SVDD kernel extension
2020/21
8
SVDD kernel extension
2020/21
9
SVDD Lagrangian
Karush–Kuhn–Tucker
the sphere center is a
(KKT) conditions
linear combination of
the transformed data
points
2020/21
10
SVDD Lagrangian
2020/21
11
SVDD Lagrangian
c 1
2020/21
12
SVDD Lagrangian
c 1
maximize over
α
2020/21
13
Difference wrt SVM
nothing here
kernel
maximize over
α
2020/21
14
Why passing through the Dual?
● Historically, the dual was introduced to use kernels, but it has been later proven that it is not
necessary
● Yet, the dual is a convex quadratic problem with linear constraints: easy to optimize!
● Have a look at the SMO and the stochastic dual coordinate ascent algorithm if you want to
know more (http://cs229.stanford.edu/notes/cs229-notes3.pdf)
maximize over
α
2020/21
15
What are the next steps?
● find alpha
● find c using
● find R
● prediction rule
2020/21
16
One-Class SVM [Schölkopf et al., 2001]
If all data points have the same feature space norm and can be separated
linearly from the origin, finding the minimal enclosing sphere is equivalent to
finding the maximal margin hyperplane between the data points and the
origin.
2020/21
17
One-Class SVM [Schölkopf et al., 2001]
If all data points have the same feature space norm and can be separated
linearly from the origin, finding the minimal enclosing sphere is equivalent to
finding the maximal margin hyperplane between the data points and the
origin.
2020/21
18
One-Class SVM
same feature space norm = translation invariant kernels = k(x,x) is a constant
maximize over
α
constant
Here the origin represents *all* the outliers (negatives) = low similarity to the
training set
2020/21
19
One-Class SVM
2020/21
20
Best thing: try it out
https://github.com/rmenoli/One-clas
s-SVM/blob/master/One%20class%
20SVM.ipynb
https://github.com/SatyaVSarma/an
omaly-detection-osvm
2020/21
21
Ranking
2020/21
22
Ranking Task
In many cases data are not annotated with simple binary or multi-class labels.
pair of documents at a time in the loss function: try and come up with the optimal ordering for that pair and compare it to the ground
truth. The goal for the ranker is to minimize the number of inversions in ranking i.e. cases where the pair of results are in the wrong
order relative to the ground truth.
list of documents: try to come up with the optimal ordering for all of them.
2020/21
23
Pairwise Ranking SVM
2020/21
24
Pairwise Ranking SVM
2020/21
25
Pairwise Ranking SVM
Relax a bit the constraints, use a max-margin learning to rank formulation
2020/21
26
Pairwise Ranking SVM
2
3
4
5
6
2020/21
27
Pairwise Ranking SVM
2
3
4
5
6
2020/21
28
Pairwise Ranking SVM
2
3
4
5
6
2020/21
29
Pairwise Ranking SVM
1
Rank Margin
2 distance between the
3 closest ranked points is
4 the rank margin -- what
we want to maximize in
5 this formulation
6
2020/21
30
Pairwise Ranking SVM
1
Rank Margin
At test time: given an 2 distance between the
image and its feature, 3 closest ranked points is
predict its relative 4 the rank margin -- what
attribute we want to maximize in
5 this formulation
6
2020/21
31
Pairwise Ranking SVM
Relax a bit the constraints, use a max-margin learning to rank formulation
x x
2020/21
32
Pairwise Ranking SVM
ICCV 2011: Marr Prize Paper
Project Page: https://www.cc.gatech.edu/~parikh/relative.html
2020/21
33
Pairwise Ranking SVM
ICCV 2011: Marr Prize Paper
Project Page: https://www.cc.gatech.edu/~parikh/relative.html
2020/21
34
Pairwise Ranking SVM
ICCV 2011: Marr Prize Paper
Project Page: https://www.cc.gatech.edu/~parikh/relative.html
2020/21
35
Pairwise Ranking SVM
ICCV 2011: Marr Prize Paper
Project Page: https://www.cc.gatech.edu/~parikh/relative.html
2020/21
36
Unsupervised Learning
today: clustering
...but (most probably) you already learned an unsupervised method...which one?
2020/21
37
Unsupervised Learning
today: clustering
...but (most probably) you already learned an unsupervised method...which one?
Principal Component Analysis - we will not go over this subspace projection strategy in this course but we
remind that it is a method based only on data, not on their labels. Thus it is unsupervised. If you want to
know more, check Chapter 23.1 in “Understanding Machine Learning: From Theory to Algorithms” book by
Shalev-Shwartz and Ben-David or Chapter 12 in “Pattern Recognition and Machine Learning” book by
Bishop.
2020/21
38
Unsupervised Learning
2020/21
Slide Credit: A.Smola, B. Póczos 39
But similarity is a difficult concept
2020/21
Slide Credit: A.Smola, B. Póczos 40
k-means algorithm
Assumptions
● Assume Euclidean space / distance
● Start by picking k, the number of clusters (groups of data)
● Initialize the clusters by picking one point per cluster
● For the moment we assume to pick k points at random
Repeat step 2 and 3 until convergence: all centroids stabilize, the points do not move
between clusters anymore
2020/21
41
Example k = 2 (starting point)
Quite reasonable
clustering.
We are done!
2020/21
Slide Credit: A.Smola, B. Póczos 49
Example k = 3
2020/21
Slide Credit: A.Smola, B. Póczos 50
Example k = 3
2020/21
Slide Credit: A.Smola, B. Póczos 51
Example k = 3
2020/21
Slide Credit: A.Smola, B. Póczos 52
Example k = 3
2020/21
Slide Credit: A.Smola, B. Póczos 53
Example k = 3
Quite reasonable
clustering.
We are done!
2020/21
Slide Credit: A.Smola, B. Póczos 54
How do we choose k?
D = average of
the distances
to the red
centroid
D = average of
the distances
to the blue
2020/21 centroid 55
How do we choose k?
D = average of
D = average of the distances
the distances to the green
to the red centroid
centroid
D = average of
the distances
to the blue
2020/21 centroid 56
How do we choose k?
D = average of
D = average of the distances
the distances to the green
to the red centroid
centroid
D = average of
the distances D = average of
to the blue the distances
centroid to the purple
2020/21
centroid 57
How do we choose k?
D = average of
the distances
D = average of to the green
the distances centroid
to the orange
centroid
D = average of
the distances
to the red
centroid
D = average of
the distances D = average of
to the blue the distances
centroid to the purple
2020/21
centroid 58
How do we choose k?
error cost
(D+D)/2
(D+D+D+D)/4
(D+D+D+D+D)/5
(D+D+D)/3
1 2 3 4 5 6
2020/21
59
How do we choose k?
Elbow!
error cost
(D+D)/2
(D+D+D+D)/4
(D+D+D+D+D)/5
(D+D+D)/3
1 2 3 4 5 6
2020/21
60
How do we select the k points?
Few simple choices
1. at random
2020/21
61
What are we really optimizing?
2020/21
62
What are we really optimizing?
2020/21
63
What are we really optimizing?
2020/21
64
What are we really optimizing?
2020/21
65
What are we really optimizing?
2020/21
66
What are we really optimizing?
2020/21
67
What are we really optimizing?
2020/21
68
What are we really optimizing?
2020/21
69
What are we really optimizing?
2020/21
70
What are we really optimizing?
2020/21
71
Complexity
2020/21
72
Convergence
Terminates when we have very small oscillations around a local minimum which most of the times
is a reasonable clustering solution.
2020/21
73
Clustering Visual Application: Segmentation
2020/21
Slide Credit: David Sontag 74
Clustering Visual Application: Segmentation
2020/21
Slide Credit: David Sontag 75
Best thing: try it out
2020/21
76
Best thing: try it out
2020/21
77
Best thing: try it out
2020/21
78