INF442-DataScienceBooklet (1)
INF442-DataScienceBooklet (1)
Steve Oudot
Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation
Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation
Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation
Data production
Data are produced at an unprecedented rate by:
▶ Industry / Economy
▶ Sciences
▶ End users
Challenges
Big data
(streamed, online, distributed)
AI for games:
ImageNet Challenge:
[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009]
ImageNet Challenge:
▶ 2012:
this network more than 60−→
hasbreakthrough million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
rrow ▶ by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012]
Data science’s celebrated successes...
ImageNet Challenge:
▶ 2012:
this network more than 60−→
hasbreakthrough million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
▶ unsupervised
unsupervised pre-training ≃ using auto-encoders
pre-training as learning
leads to concept feature generators to face,
(e.g. human be plugged into sup
cat face)
[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012]
▶ Human
from pre-trained
Feedback
by Reinforcement
(RLHF) Learning from Human Feedback (RLHF)
Microsoft’s Tay:
Core topics:
▶ statistical analysis
▶ pattern recognition
▶ data mining
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
Note: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de traitem
{
v1 ··· vd
{
x1
coordinate
···
observations
matrix
xn
Vector
Note: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de traitem
{
x1 ··· xn
{
Metric representation x1
distance /
···
observations (dis-)similarity
matrix
xn
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
Note: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de traitem
Metric representation
Rd
∈
?
···
feature extraction
Programming languages for data science
note: all other modern languages are built on SQL (e.g. QBE is in fact just a front-end
note: what is taught is: (1) principles of each approach; (2) how to apply it in Python
Programming languages for data science
[...]
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
▶ classification (categorical labels)
▶ regression (continuous labels)
▶ forecasting (regression on time series)
energy consumption
? weather parameters
Learning paradigms
Unsupervised learning
Input: data without labels
Typical problems:
▶ clustering
▶ dimensionality reduction
▶ anomaly detection / noise removal
Learning paradigms
Unsupervised learning
Supervised learning
Learning paradigms
Reinforcement learning
Input: Markov decision process:
▶ agent & environment states, vis. rules, actions, transition probabilities, rewards
Typical problems:
▶ control learning
Learning paradigms
Reinforcement learning
Input: Markov decision process:
▶ agent & environment states, vis. rules, actions, transition probabilities, rewards
Typical problems:
▶ control learning
[Géron 2017]
Learning paradigms
(source: NVIDIA)
Nearest-Neighbors Search
Outline:
• Problem statement
• k-d trees:
- definition
- construction
• Problem statement
• k-d trees:
- definition
- construction
pre-processing input: P
query input: q
d(q, P )
Nearest neighbor search
Variants:
• metrics:
▶ ℓ 2 , ℓp , ℓ∞
▶ ···
• ···
?
Outline
• Problem statement
• k-d trees:
- definition
- construction
Linear scan
Input: P = {p1 , · · · , pn } ⊂ Rd , q ∈ Rd
for i = 1 to n do:
return dwe
the following, or index
minwill that achieves
usuallyi record only d, dnot
min i, to simplify the pseudo-code; storing the
Complexity:
• Problem statement
• k-d trees:
- definition
- construction
Strategy:
▶ preprocess the n point of P in Rd into some data structure DS
for fast nearest-neighbor queries answers
Core difficulties:
▶ concentration
Detail Curse of dimensionality: hardintohigh
of distances outperform linear[Demartinez’94]
dimensions scan in high d
▶ Interpretation: meaningfulness of distances in high d (concentration)
Popular approaches
(quadtree) (k-dtree) (RP-tree)
this is• the
Linear scan
baseline O(dn) space and time
• Voronoi diagrams
▶ ···
• Problem statement
• k-d trees:
- definition
- construction
kd-tree
rigin of the name: in the original 1977 paper [Friedman, Bentley, Finkel: An algorithm for finding b
Thus each•leaf
subdivision
represent stops whenever
a trivial partitionfewer thana nsingle
(i.e. with 0 remain
subset) with ≤ n0 points
⇝ size: O(dn)
(n0 = 1)
kd-tree
rigin of the name: in the original 1977 paper [Friedman, Bentley, Finkel: An algorithm for finding b
Thus each•leaf
subdivision
represent stops whenever
a trivial partitionfewer thana nsingle
(i.e. with 0 remain
subset) with ≤ n0 points
⇝ size: O(dn)
kd-tree specifics:
several variants
H of orthogonal
is the to coordinate
construction axis
exist: - the historical
(possible
one
choices:
choosescyclic
at each
iteration,
step the
max
coordinate
spread) along
H goes
particular, the through
median the median
rule implies in the
that the considered
kd-tree coordinate
is balanced
(n0 = 1)
Example
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
• Problem statement
• k-d trees:
- definition
- construction
Recursive construction
2 types of nodes:
else:
compute median m of {p[c] : p ∈ P } and p∗ ∈ P such that p∗ [c] = m
else:
compute median m of {p[c] : p ∈ P } and p∗ ∈ P such that p∗ [c] = m
Recursive construction
Median computation:
• by sorting the points of the current cloud P
• by linear
The randomized medianis(randomized
linear median based on the or deterministic,
same cf. INF562)
idea as QuickSort with a randomized choice of
• Problem statement
• k-d trees:
- definition
- construction
if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}
else:
dmin := min{dmin , d(q, node.point)} Important: show on the picture which cells of the subdivision
q
search (node.lef t)
if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}
else: q′
dmin := min{dmin , d(q, node.point)} Important: show on the picture which cells of the subdivision
q
search (node.lef t)
Example
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}
else: q′
dmin := min{dmin , d(q, node.point)}
search (node.lef t)
else: q′
dmin := min{dmin , d(q, node.point)}
search (node.lef t)
else: q′
dmin := min{dmin , d(q, node.point)}
search (node.lef t)
Example
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
Example
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
p1 l4 l6 l1
l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8
Example
Thequery
constant is exponential
time c=d O(c d log n) in d
c d ≈ 2d
[Friedman et al.: An algorithm for finding best matches in logarithmic expected time, 1977]
Example
⇒
query time = Ω(d n)
• Problem statement
• k-d trees:
- definition
- construction
Benchmarks
avg.
lack of query
linearity of time (µs)scan
the linear vs. is#probably
data points:
due to the
(uniform
asymptotic
measure
regime
in unit
starting
square
pretty
in 2d)
late (appa
Benchmarks
avg. stands
query point queryattime (µs) in
the origin vs.this
# experiment.
data points: (uniform
Beware measure
that the onhas
Y-scale unitchanged.
circle in In
2d)fact, the
High dimensions
pre-processing input: P ⊂ Rd
query input: q
NNP (q)
goal: find p ∈NNP (q)
q
Curse of Dimensionality:
d(q, P ) Every data structure for NN-search has
either exponential size or exponential
query time (in d) in the worst case.
→ holds both in theory and in practice [Weber et al. ’98] [Arya et al. ’98]
High dimensions
pre-processing input: P ⊂ Rd
query input: q
Curse of Dimensionality:
Every data structure for NN-search has
either exponential size or exponential
query time (in d) in the worst case.
→ holds both in theory and in practice [Weber et al. ’98] [Arya et al. ’98]
Benchmarks
avg.
oint of queryoftime
inversion (µs) and
the linear vs. backtracking
dimension: plots
(10,000
lies around
pts sampled
dim=12. → 2 caveats:
uniformly inside unit
- the
cube)
implementa
• Linear scan
• Curse of dimensionality
Clustering with k-Means
Outline:
• Initialization
• Initialization
ill-posed
a pas de realite terrain fournie, i.e. pas de labels au depart sur les donnees, donc tout reside dans
Task:
partition the data points into homogeneous subsets (clusters)
A wealth of approaches
Variational
Density thresholding
- k-means
- k-medoids - DBSCAN
- OPTICS
- EM
Mode seeking
Spectral
- Mean/Medoid/Quick Shift
- Normalized Cut
- Multiway Cut - graph-based hill climbing
Valley seeking
Hierarchical divisive/agglomerative
- [JBD’79]
- single-linkage
- NDDs [ZZZL’07]
- BIRCH
Outline
• Initialization
k-Means
Paradigm: cast clustering into an optimization problem
1 X
min ∥p − cσ(p) ∥22
c1 ,··· ,ck ,σ n p∈P
k
X ni min ∥p − cσ(p) ∥22
= Var(Ci , ci ) c1 ,··· ,ck ,σ n p∈P
i=1
n
(weighted
ttention: variancesum
avecofun
cluster
choix variances)
particulier de centre ci
{
Var(P, c1 , · · · , ck , σ)
Outline
• Initialization
1 X
Prop: argmin Var(Ci , ci ) = p
ci ni p∈C
i
{
{
1 X
Prop: argmin Var(Ci , ci ) = p =: c∗
ci ni p∈C
i
proof:
proof is simpler and more direct than in the polycopié
1 X 1 X
Var(Ci , ci ) = ∥p − ci ∥22 = ∥(p − c∗ ) + (c∗ − ci )∥22
ni p∈C ni p∈C
i i
1 X 2 X 1 X ∗
= ∥p − c∗ ∥22 + (p − c∗ ) · (c∗ − ci ) + ∥c − ci ∥22
ni p∈Ci
ni p∈Ci
ni p∈C
i
≥0 □
Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ
proof:
□
Characterizing the argmin
Fixed centers: c1 , · · · , ck
Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ
V (ci ) := {x ∈ Rd | ∥x − ci ∥2 ≤ ∥x − cj ∥2 ∀j}
∥x − ci ∥22 = ∥x − cj ∥22
⇐⇒ x2 − 2 x · ci + c2i = x2 − 2 x · cj + c2j
⇐⇒ 2 x · (cj − ci ) + (c2i − c2j ) = 0
(affine equation) □
Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ
V (ci ) := {x ∈ Rd | ∥x − ci ∥2 ≤ ∥x − cj ∥2 ∀j}
half-space
vs.
Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).
proof:
• no pts on boundaries ⇒ Var(P, c∗1 , · · · , c∗k , σ) > Var(P, c∗1 , · · · , c∗k , σNN ) ∀σ ̸= σNN
Characterizing the argmin
This characterization is not unique:
Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).
proof:
• no pts on boundaries ⇒ Var(P, c∗1 , · · · , c∗k , σ) > Var(P, c∗1 , · · · , c∗k , σNN ) ∀σ ̸= σNN
Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).
the data point on the bisector can be assigned indifferently to the blue cluster or to the red
Outline
• Initialization
now look
Voronoi
at ancells
optimal cluster ⇒
are convex optimal clusters are contiguous: C1 < C2 < · · · < Ck
configuration
{
OPT(n, k) = min {OPT(j − 1, k − 1) + (n + 1 − j) Var({pj , · · · , pn })}
1≤j≤n
{
sum of squared distances to the mean
OPT(n, k) = 0 if n = 0 and +∞ if n ̸= 0 = k
p1 p2 pj pn
C1 C2 Ck−1 Ck R
now look
Voronoi
at ancells
optimal cluster ⇒
are convex optimal clusters are contiguous: C1 < C2 < · · · < Ck
configuration
{
OPT(n, k) = min {OPT(j − 1, k − 1) + (n + 1 − j) Var({pj , · · · , pn })}
1≤j≤n
{
sum of squared distances to the mean
OPT(n, k) = 0 if n = 0 and +∞ if n ̸= 0 = k
P 2
3 2 n
complexity:
quires precomputing k) naive,
O(n the partialorsums with
O(n k)i=j pi linear-time 1, · · · , n, taking
aggregation
for all j = O(n2 ) time
of variances
Computation: hard general case
For arbitrary k > 1 and d > 1:
Problem is NP-hard:
- for arbitrary d, even when k = 2
[Aloise
that earlier et of
proofs al. NP-hardness
2009] existed, but that they all turned out to be flawed. Other correct
• Initialization
Repeat:
(source: Wikipedia)
Input: P = {p1 , · · · , pn } ⊂ Rd , k≥1
k-means
Repeat:
→ special
clustering instance
model of the EM by
is parametrized the cluster centers c1 , · · · , ck . The E-step computes the poste
algorithm
EM is more general: non-uniform weights, anisotropic gaussians, soft clustering...
proof:
this shows
⇒that
totalthe algorithm
variance converges
decreases strictly at
at the
eachlimit (since decreasing
non-terminal non-negative
iteration (stopping energy),
criterion)
After each iteration, the lowest variance is achieved for the current Voronoi partition σ
strict decrease
⇒ each Voronoi partition is considered during ≤ 1 non-terminal iteration
proof:
this shows
⇒that
totalthe algorithm
variance converges
decreases strictly at
at the
eachlimit (since decreasing
non-terminal non-negative
iteration (stopping energy),
criterion)
After each iteration, the lowest variance is achieved for the current Voronoi partition σ
strict decrease
⇒ each Voronoi partition is considered during ≤ 1 non-terminal iteration
There are finitely many different Voronoi partitions at most O nkd iterations
equires toProp:
have no
Thedata point on
algorithm Voronoi generically
converges cell boundaries
to a throughout the iterative
local minimum process
of the total variance.
Shortcomings:
• depending on the initialization, the local minimum reached can be arbitrarily bad
[MacQueen 1967]
Note: certified (1 + ε)-approx. algos. typically run in time O(2k ε−d n polylog(n, k, d))
[Har-Peled, Mazumdar 2004]
Outline
• Initialization
Initialization
Random centers:
• sampled
rgy published uniformly
the same among
method the data
as [Lloyd points
1957] with[Forgy
this particular
1965] initialization
→speaking,
generally performances
Forgy’shighly
methoddepend on input
is preferred and data
performs best among these approaches, although
C := {c1 }
for i = 2 to k do:
d(p, C)2
P(p) := P 2
, where d(x, C) = min ∥x − cj ∥2
q∈P d(q, C)
1≤j<i
C := C ∪ {ci }
done
Initialization
k-means++ [Arthur, Vassilvitskii 2007]
→ theoretical
s already guaranteed after theon
guarantees k-means++ initialization,
the resulting therefore
(initial) total also upon termination
variance
Prop: In expectation, the total variance is within a factor O(log k) of the optimal:
• Initialization
k = 2 (underfitting) k = 4 (overfitting)
Choosing the number k of clusters
elbow
k
2 3 4 5 6 7 8 9 10
arm forearm
b(p) − a(p)
silhouette s(p) := ∈ [−1, 1]
max{a(p), b(p)}
b(p) − a(p)
silhouette s(p) := ∈ [−1, 1]
max{a(p), b(p)}
k
n
1
2 3 4 5 6 7 8 9 10
-1
What you should know
Outline:
• Single-linkage clustering
• Single-linkage clustering
Hierarchical clustering
Input: a finite set of n observations: - point cloud with coordinates
- distance / (dis-)similarity matrix
Task:
partition the data points into k homogeneous subsets (clusters)
Hierarchical clustering
Q: what is the number k of clusters? what if there is more than one solution?
k = 1?
k = 2?
k = 4?
k > 4?
Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters
scale
Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters
scale
Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters
θ : R+ → Partitions (P )
▶ θ(0) = {singletons (P )}
▶ ∃t0 : ∀t ≥ t0 , θ(t) = {P }
▶ ∀t ≤ t′ , θ(t) refines θ(t′ ):
∀C ∈ θ(t), ∃C ′ ∈ θ(t′ ) : C ⊆ C ′
scale
Building the hierarchy
agglomerative divisive
(merge (split
clusters) clusters)
Combinatorial aspects:
• n/−merge
division operation
1 steps for eachadds / subtracts
approach one cluster
(with two-fold merges or splits)
• at step k:
n−k
AHC: n − k clusters and 2 choices for merge
the most commonly
frame means that in the course we only look at AHC, for the reasons just invoked.
used approach
⇒ average size of search space: Θ(n2 )
P
Choice of splitDHC:
: a. choose cluster
k clusters n1 , ·⇒
to split
of sizes · · sum and ki=1
, nk , choices ni −1
over 2all − 1 choices
clusters; b. for each cluster
for split
• Single-linkage clustering
done
→aspect,
this defer the students
use a union-find to their
data structure algorithmics
(e.g. courses
disjoint-set (e.g.
forest), as inINF421)
Kruskal’s algorithm
Agglomerative hierarchical clustering
done
′ 1 X X
δAL (C, C ) := ′
d(p, p′ )
|C| |C | p∈C ′ ′
p ∈C
Impact of choice of distance δ
▶ n = 150 observations
▶ d = 4 variables:
sepal length/width,
petal length/width
▶ k = 3 species:
virginica, versicolor, setosa
http://archive.ics.uci.edu/ml/datasets/Iris
▶ single-linkage:
▶ complete-linkage:
+ balanced tree
- sensitive to outliers
- no theoretical guarantees
▶ average-linkage:
+ tradeoff balancedness/stability
- no theoretical guarantees
Distances based on statistical quantities
involves Distance
a statistical
δ based
quantity
on (weighted)
that connects
intra-cluster
AHC to k-means — Ward’s
variance(same criterion:
objective function: sum of
▶ next
weighted merge is variance
intra-cluster the one isthat leastbut
nothing increases
the sum the (weighted)
of squared intra-cluster
distances variance
to the mean, as in k-means
δ(C,
gros on divise C ′ ) := |Cla∪distance
globalement C ′ |Var(C ∪C
entre ′
) −clusters
deux (|C|Var(C)
par la + |C ′total
taille du ′ cluster
|Var(C ∪ C ′(i.e.
)) / |C joint | par
′
|C| |C | ′ 2
= (|C|+|C ′ |)2∥EC − EC ∥2 in Euclidean space
δ({pi }, {pi }) = 0
p1 p2 p1 p2 p3
• Single-linkage clustering
Single-linkage clustering
δ(C, C ′ ) := min
′
d(p, p′ )
p∈C,p ∈C ′
(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously
δ(C, C ′ ) := min
′
d(p, p′ )
p∈C,p ∈C ′
(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously
Prop:the(invariant)
words, clusters are
Forgiven i ∈
any pby the
P CCs
and of s≥
anythe neighborhood
0, graph. This invariant shows
Single-linkage clustering
δ(C, C ′ ) := min
′
d(p, p′ )
p∈C,p ∈C ′
(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously
Details / complexity:
2
re-sorting
- sort
is edges
what allows
in O(nthelogalgorithm
n) time using
to be merge
fast, compared
sort to complete linkage or average
2
Ackermann function
- iterate is defined
over edges recursively
in O(n by a using
α(n)) time doubledisjoint-set
induction forest
as follows:
inverse Ackermann function
n + 1 if m = 0
Example
when δ = s:
s-neighborhood graph
≡
intersection graph of balls of radius s/2 δ
0 2 4 6 8 10 12 14 16
Example
when δ = s:
s-neighborhood graph
≡
intersection graph of balls of radius s/2 δ
0 2 4 6 8 10 12 14 16
Example
pi
pj
equality
heightholds by common
of least definition of the (LCA)
ancestor dendrogram. By our previous invariant, it is equivalent to
=
smallest s for which Cs (pi ) = Cs (pj ) δ
0 2 4 6 8 10 12 14 16
Outline
• Single-linkage clustering
• identity: d(pi , pj ) = 0 =⇒ p i = pj
proof:
pk pi pj
https://www.enseignement.polytechnique.fr/informatique/INF631/
Connection to ultrametrics and stability
connection makes it possible to prove formal properties on single-linkage clustering.
resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in
https://www.enseignement.polytechnique.fr/informatique/INF631/
resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in
resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in
s
Outline
• Single-linkage clustering
Process:
• run average-linkage clustering on M ⇝ dendrogram θAL
d
d
• scale θAL by a factor of 1/2 ⇝ dendrogram θAL
′d
:
• Stability result
Outline:
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
Outline
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
Mathematical formulation
Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees
Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees
Sample applications:
noise filtering
Mathematical formulation
Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees
Sample applications:
outlier detection/removal
Mathematical formulation
Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees
Sample applications:
density estimator
Rd
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
f
E
P
n fˆn
r
+
density
Va
r
graphique represente la vraie densite (en bleu), la moyenne
P
n fˆ de l’estimateur (en rouge plein),
n
EP n fˆn
EP r
n fˆn −
Var
Pn fˆn
Rd
Quality of a density estimator
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Thm: MSEPn (fˆn (x)) = BiasPn (fˆn (x))2 + VarPn (fˆn (x))
Proof sketch: decompose fˆn (x) − f (x) = fˆn (x) − EPn fˆn (x) + EPn fˆn (x) − f (x)
then follow your intuition... □
Quality of a density estimator
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞
n→∞
Prop: MSEPn (fˆn (x)) −→ 0 =⇒ fˆn (x) is consistent
e.g. LemmaProof
2.2.2 sketch:
of Durrett’s ˆ
let XProbability
n = fn (x)-−Theory
f (x) . and Examples,
Using 4th edition.
Chebyshev’s inequality,
n→∞
convergence in L2 (i.e. E Xn2 −→ 0) implies convergence in probability. □
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞
n→∞
EXn2 −→ 0 =⇒ VarXn + (EXn )2 −→ 0 =⇒ VarXn −→ 0 and |EXn | −→ 0.
∃N ∈ N s.t. ∀n > N , |EXn | ≤ ε/2
4
=⇒ ∀n > N , P(|Xn | ≥ ε) ≤ P(|Xn − EXn | ≥ ε/2) ≤ ε2
VarXn −→ 0.
(Chebyshev)
Quality of a density estimator
Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
2
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem
Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞
expectation, variance, bias, MSE of estimator are not perturbed too much
by adding outliers (e.g. consistency and asymptotic unbiasedness preserved)
Outline
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
→ examples
• Normal/Gaussian, Poisson, exponential families, etc.
• mixture models
→ examples
• histograms
• kernel density estimators
• k-NN estimator
Outline
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.
iid
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2
→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
2
1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2
consistent
Estimators: (continuous
The continuous mapping
mapping theorem)
theorem states that, when
n
1X
• empirical mean: µ̂n := pi → unbiased, consistent (law of large numbers)
n i=1
n
1X
Indeed,•aempirical
calculation
variance: σ̂n2
shows that (pi − µ̂n )2 →empirical
:= the mean/expected negativelyvariance is always smaller
biased, consistent
n i=1
This theorem states that theorem)
(Cochran’s the corrected empiri
n
2 1 X
• corrected empirical variance: σ̂n := (pi − µ̂n )2 → unbiased, consistent
n − 1 i=1
parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.
iid
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2
→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
2
1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2
Rates of convergence:
√
• empirical mean: |µ̂n − µ| = OPn 1/ n (Berry-Esseen theorem)
√
Note: ∆ is stochastic boundedness:
a stochastic ∀ε depend
bound, it may > 0 ∃∆,onNε>and
0 s.t. PPn (|µ̂n −large,
be arbitrarily µ| n possibly
> ∆) < ε, ∀n >to
diverging N infinit
√
• empirical variance (corrected or not): |σ̂n2 − σ 2 | = OPn 1/ n
√
ollows •from theestimator:
density |fˆn (x)via
above bounds − fsome calculation.
(x)| = OPn 1/ nThe obtained bound
(follows from implies that
a calculation)
parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.
iid
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2
→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
2
1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2
Non-Gaussian case: (ν has mean µ and variance σ 2 but ν ̸= N µ, σ 2 )
2
1 (x − µ) √
• fˆn (x) the
is because still empirical to f¯(x)and
converges mean √
:= varianceexp
still −
converge2 to at rate
the true1/mean
n and va
2πσ 2 2σ
bias• is|f¯(x)
constant (i.e.
− f (x)| ̸= 0independent of the sampling), therefore it cannot be reduced
(constant bias)
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)
Estimators:
n
1X
• empirical
is because themean: µ̂n :=
d-dimensional pi → unbiased,
(empirical) consistent
mean is the vector of coordinate-wise 1-dimen
n i=1
n
1X
• empirical covariance: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → biased, consistent
n i=1
n
1 X
• corr. emp. cov.: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → unbiased, consistent
n − 1 i=1
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)
n
1X
• empirical covariance: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → biased, consistent
n i=1
n
1 X
• corr. emp. cov.: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → unbiased, consistent
n − 1 i=1
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)
Rates of convergence:
p
• empirical mean: ∥µ̂n − µ∥2 = OPn a different
d/n (mean is defined argument applies
coordinate-wise)
p
• empirical covariance (corrected or not): ∥Σ̂n − Σ∥ = OPn d/n
operator
i.e. ∥M ∥ = supnorm ∥M.v∥
v̸=0 ∥v∥
p
Basically •the exactestimator:
density ˆ
|fn (x)
same analysis − fthe
as in (x)|1-d
= case
OPn holds,
d/nexcept thecalculation
(same influence ofasthe dimension
in 1-d)
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)
Non-Gaussian case: (ν has mean µ and covariance matrix Σ but ν ̸= N (µ, Σ))
1 1
• fˆn (x) still converges to f¯(x) := p exp − (x − µ)T Σ−1 (x − µ)
d
(2π) det Σ 2
p
at rate d/n
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
parametric
Gaussian mixture models (GMMs)
X r
cetteTarget
densitedensity:
dite ”demixture
melange” f¯(x) apres
(on verra
density = pourquoi) est par definition une combinaison
ξl ϕl (x), where:
l=1
iid Pr
Underlying generative model: p1 , · · · , pn ∼ l=1 ξl Nd (µl , Σl ) (GMM)
parametric
Gaussian mixture models (GMMs)
X r
cetteTarget
densitedensity:
dite ”demixture
melange” f¯(x) apres
(on verra
density = pourquoi) est par definition une combinaison
ξl ϕl (x), where:
l=1
parametric
Gaussian mixture models (GMMs)
Approach:
estimateur visefor a fixed r, estimate
a approcher f¯. the weights ξl and parameters µl , Σl , then define:
X r
fˆn (x) := ξˆl Φµ̂l ,Σ̂l (x)
l=1
Maximum
on travaille souslikelihood
l’hypothese estimation (MLE):proviennent effectivement d’un modele de melange
que les donnees
n r
!
X X
(ξˆl , µ̂l , Σ̂l )rl=1 := argmax log ξl Φµ ,Σ (pi ) l l
(ξl ,µl ,Σl )r
l=1 i=1 l=1
log-likelihood of (pi )n r
i=1 given (ξl , µl , Σl )l=1
iid Pr
·· ▶ ∼ l=1 of
, pnlikelihood ξl N
parameters (ξl , µl , Σl )rl=1
d (µl , Σl ) Probability := probability
of observing of observing
Pn given theofsample
the choice parameters
Pn
r
X
L(pi ; (ξl , µl , Σl )rl=1 ) = ξl Φµl ,Σl (pi ) (mixture density)
l=1
n
Y
L((pi )n
i=1 ; (ξl , µl , Σl )rl=1 ) = L(pi ; (ξl , µl , Σl )rl=1 ) (independence)
i=1
parametric
Gaussian mixture models (GMMs)
Approach:
estimateur visefor a fixed r, estimate
a approcher f¯. the weights ξl and parameters µl , Σl , then define:
X r
fˆn (x) := ξˆl Φµ̂l ,Σ̂l (x)
l=1
Maximum
on travaille souslikelihood
l’hypothese estimation (MLE):proviennent effectivement d’un modele de melange
que les donnees
n r
!
X X
(ξˆl , µ̂l , Σ̂l )rl=1 := argmax log ξl Φµ ,Σ (pi )
l l
(ξl ,µl ,Σl )r
l=1 i=1 l=1
log-likelihood of (pi )n r
i=1 given (ξl , µl , Σl )l=1
▶ good implies
Consistency asympt.asymptotic
behavior w.r.t. f¯: consistency,
unbiasedness achieves Cramér–Rao
- The Cramér-Rao lower
lower bound is a bound
lower bound
▶ no closed-form solution
▶ variational
these solvers are quickly
solvers
costly
(gradient
(e.g. EM with Expectation-Maximization)
ascent, 50 Gaussians is the Holy Grail) and/or get stuck
▶ non-concave functional ⇒ local maxima, non-unique global maximum
Indeed,▶the largerofr,mixture
choice the larger
sizethe number
r: large biasof(small
parameters
r) vs. in the variance
large model and estimator,
(large r) hence
Outline
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
nonparametric
Histograms
Uniform grid:
Cell(x)
N
nonparametric
Histograms
(hypothesis: f Lipschitz-continuous)
Bias:
N
n
d X
EPn la
on applique fˆnlinearite P (pi ∈ Cell(x)) = N d P (p1 ∈ Cell(x))
(x) = de l’esperance
n i=1
Z
d
fonction (continue) atteint
= N son minimum f (y)∈et[fson
f (u) du (y),maximum (z) sur
f (z)] for fsome ∈ Cell(x)
y, zCell(x) (qui est compa
Cell(x)
this (f
implies that f Cell(x)
continuous, reachescompact)
its minimum
∗
= f (x ) for some x∗ ∈ Cell(x)
take(intermediate
any path (invalue
facttheorem,
the straight-line segment) between y and z in the
Cell(x) path-connected)
√
d
Bias fˆn (x) = EPn fˆn (x) − f (x) = |f (x∗ ) − f (x)| ≤ Lipf
N
N 2d
comes from the definition P (p1for
= of variance ∈ Cell(x))
a discrete−univariate random2 variable v: Var(v) :=
P (p1 ∈ Cell(x))
n
d
N ∗ 1 ∗ 2
that f (x∗ ) depends only=on the true ) − fd ,fand
f (xdensity (x )not on the grid
Note: step increases
variance N (recall with
its definition
N
n N
1
d N d f (x∗ ) 2n Lip2 d+2
MSEPn fˆn (x) ≤ Lip2f + The optimal f
NOPT =N annihilates
f (x∗ )
the derivative
N2 n
2
MSEOPT fˆn (x) = O n − d+2
nonparametric
Histograms
The▶ tessellation
the grid does not adapt
is chosen to thew/o
a priori, shape of the support
knowledge of ν of the support of the densit
of the shape
▶ the tessellation
It actually depends oncan become
what costly
you do withtothe
maintain
density:as- For
d increases
mere pointwise evaluation,
1
d N d f (x∗ ) 2n Lip2 d+2
MSEPn fˆn (x) ≤ Lip2f 2 + The optimal f
NOPT =N annihilates
f (x∗ )
the derivative
N n
2
MSEOPT fˆn (x) = O n − d+2
Outline
• Mathematical formulation
• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)
• Non-parametric estimators:
- histograms
- kernel density estimators
nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of
nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of
is aGeneral formula:
convolution of the(convolution)
empirical measure by the density KH
n
ˆ 1 X
fn (x) := KH (x − pi ), where KH (u) := (det H)−1/2 K(H −1/2 u)
n i=1
bandwidth / window
Z R −1 R
for thenormalizing
normalization of cthe
factor: kernel: k(∥u∥
k,d := Rd
2
K(u) du = Rd ck,d k(∥u∥2 ) du = 1
2 ) du
Rd
Xn 2
c k,d ∥x − p i ∥
⇝ fˆn (x) := k 2
n σ d i=1 σ2
nonparametric
Common kernels
(
1 if t ≤ 1
Flat / Uniform: kU (t) := ⇝ ck,d = 1/Vol Bd (0, 1)
0 if t > 1
Γ(d/2 + 1)
=
π d/2
1
0 1
(
1 − t if t ≤ 1 d+2
Epanechnikov: kE (t) := ⇝ ck,d =
0 if t > 1 2 Vol Bd (0, 1)
kE KE
0 1
nonparametric
Common kernels
kN KN
0 1
(
1 − t if t ≤ 1 d+2
Epanechnikov: kE (t) := ⇝ ck,d =
0 if t > 1 2 Vol Bd (0, 1)
kE KE
0 1
nonparametric
Common kernels
40 50 60 70 80 90 100
σ=1
σ=3
σ = 10
nonparametric
Convergence rates
showRadially-symmetric Gaussian
the case of the Gaussian kernel
kernel for in Rd :- it is a typical behavior for a KDE - it is the
2 reasons:
n
X
1 ∥x − pi ∥22
fˆn (x) := exp −
(2π)d/2 n σ d i=1
2σ 2
Bias: EPn ˆ
fn (x) − f (x) = O σ 2 (decreases as σ → 0)
1
Variance: VarPn fˆn (x) = O (increases as σ → 0)
n σd
1
Mean squared error: MSEPn fˆn (x) = O σ + 4
n σd
1
4
σOPT = n − d+4
=⇒ MSEOPT fˆn (x) = O n − d+4
Summary
p
parametric
1
curse of dimensionality
Histogram O n− d+2 number N of bins
computation cost
2
− d+4
Kernel density O n bandwidth σ here, curse of dimensionality
the curse of dimensionality
What you should know
Outline:
• Cross-validation
• Cross-validation
Supervised learning
Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y
Y=R
?
regression: Y continuous X =R
classification: Y discrete
?
X = {images}
Y = {labels}
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression) or Y = {1, · · · , κ} (classification)
Statistical framework
Prediction error is measured by a loss function L : Y × Y → R
ab
so
Hu
lu
te
be
r
Statistical framework
Prediction error is measured by a loss function L : Y × Y → R
yi f (xi )
Outline
• Cross-validation
P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint
es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
∗ 2
f (x) := argmin
our guess for f (x), andy∈Y E
we take (Y − y) | X = x Y (conditioned
the one that minimizes the expected error
(Y |X)
P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint
es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
f ∗ (x)
our guess for :=and
f (x), argmin
we E(Y |X)
take the (Y −
one y)2minimizes
that | X = x the expected error (conditioned
y∈Y
no control over the regularity of f ∗
this is prescribed by the
∗
→ minimizer:
is where f of
the choice (x)the E(Y |X) loss
= square [Y |comes
X = x]
into(regression
play. The function)
expression above is the Frechet
k-NN regression
iid
xi ∼ X with values in X = Rd
Hyp: iid unknown probability distributions
yi ∼ Y with values in Y = R (regression)
Y
∗
f (x) = E(Y |X) [Y | X = x]
1 X (regression estimator)
fˆn,k (x) := yi
k (variant:
more responses
generally, weighted
one can chooseby inverse
a set distances to weights
of non-negative x)
xi ∈NNk (x)
Thm:
Bibliographic (universal
note: consistency)
Stone proved [Stone
sufficiency, 1977]
Devroye [Devroye
refined 1982]
it and proved necessity
ounded means that there is some constant η ≥ 0 such that P(|Y | > η) = 0.
Suppose Y is a bounded random variable. Then, the estimator fˆn,k is consistent
Notice that there is no
if that
also and the
onlyresult
if theholds
choice of kambient
in any k→
satisfies d,
= k(n)dimension ∞ and
which is notk/n → 0 as
involved n→
in the ∞. The
bounds.
Note: fˆn,k
the predictor fn,kisconverges ∀xbest
consistenttoif:the , plim fˆn,kpredictor
∈ Xtheoretical (x) = f ∗f(x).
∗ (regression function) in probability
n→∞
Outline
• Cross-validation
y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...
step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
Classification with 0-1 loss
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)
y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...
step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
⇒ Bayes error rate R(f ∗ ) is zero when X, Y are perfectly dependent
k-NN classification
iid
xi ∼ X with values in X = Rd
Hyp: iid unknown probability distributions
yi ∼ Y with values in Y = {1, · · · , κ}
Y
∗
f (x) = argmax P(Y = y | X = x) κ
...
y∈{1,··· ,κ} 3
1
argmax determined conditioning on k-NNs of x X
by majority vote (P(∃ sample at x) = 0) x
• Cross-validation
• easiness of implementation:
▶ density estimation:
ˆ (x) := k 1
ominator in thefn,k
second fraction is actually
n Vd ∥x − NNk (x)∥the dvolume of the ball of radius the distance of
2
difference• of k-NN
slow classifier’s
convergence errordimensions
in high rate compared
(cursetoofthe (optimal) Bayes error rate converges
dimensionality):
▶ asymptotic
fact that the asymptoticregime
regime is not
often not attained practice⇝not
in practice
attained in onlytoimposes
need select k:to select
• Cross-validation
Cross-validation
ransition: one solution to cope with the issue of selecting k is cross-validation, which is
Principle:
-NN we▶threshold
explore the hyperparameter
at an space
upper limit value of k.orWe
a subset
can alsothereof (e.g.thevia
subsample sampling)
range of values
- train
-NN the training the isclassifier
phase with
trivial, but thethese values
testing phase is costly
In practice:
This is a ▶
very important
partition aspect:
initial dataset and two
T into V must be disjoint,
subsets: because
T (training), it V
and is the prediction power
(validation/test)
▶ do the istraining
When a predictor on T
evaluated , then test hyperparameter
experimentally, values
yet another part on V
of the data must be kept aside
▶ is
Averaging average
important,
its performance
to make theover
result
some
independent
subset of all
of the choice Tof⊔partition.
partitions V The choice
Cross-validation
ransition: one solution to cope with the issue of selecting k is cross-validation, which is
Examples of methods:
Exhaustive means
”exhaustive” cross-validation:
that a subspace of partitions is entirely explored
Non-exhaustive cross-validation:
▶ bias
of large holdout:
here, use
duesingle
to therandom
use of apartition T ⊔ V (each pt assigned independently)
single partition)
▶ Monte-Carlo: repeatedly use random partitions T ⊔ V
Outline
• Cross-validation
1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m
this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class
1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
1 2 ··· κ
1
2
predictions
In this plot, the colors (blue → red) represent the fraction of the
···
κ
classes
Evaluating a classifier’s performance
Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym
′
}:
1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m
this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class
1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
▶ true positives (TP) for class i: points of this class correctly predicted in i
▶ false
These measures arepositives (FP)dependent
asymmetric, for class i:on
points of other
the class classes incorrectly predicted in i
considered
▶ true negatives (TN) for class i: points of j ̸= i predicted in l ̸= i (possibly with l ̸= j)
1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m
this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class
1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
TP
▶ precision
measures reliability / positive
of positive predicted
predictions value:ofPPV
(fraction true:=
positives among the positive predictions)
TP + FP
for binary
TP
▶ recall
measures ability / sensitivity
to capture / true instances
the positive positive rate:
(fraction
TPR of :=
positive predictions among the p
classification
TP + FN
forone-vs.-all
or multi-class, on
measures tendency to /predict negativesrate:
as positives FP
▶ fall-out false positive FPR :=(fraction of positive predictions among the
TN + FP
Evaluating a classifier’s performance
Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym
′
}:
2 2 · PPV · TPR
• F-score: FS := 1 1 = (harmonic mean of prec. & rec.)
PPV
+ TPR
PPV + TPR
▶ biased
viledges classifiers with high precision
towards positives and recall, i.e. ability to capture positives and reliability of
perfect classifier
1
• receiver
a classifier operating
that estimatescharacteristic
the posterior (ROC) curve:
probability of class +1 then chooses the label +1
er
sifi
er
as
sifi
▶ plots recall
the classifier’s curve (TPR)
falls under the fall-out
versus diagonal, then it becomes worse than the random classifier,
(FPR)
cl
as
om
cl
nd
TPR
▶ perfect
erfect classifier has classifier
FP = FN has=TPR
0, hence
= 1 Tand
P RFPR and
= 1= 0 F P R = 0.
ra
▶ random
om classifier classifier
classifies eachhas
point
TPRindependently
= FPR as +1 or −1 according to a random variable
AUC
▶ 1AUC: area under the ROC curve
AU C ≤
0 FPR 1
What you should know
Outline:
Y=R
?
X =R
regression: Y continuous
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression)
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression)
M
Hu
AE
be
r
Regression with squared error
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = R (regression)
P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint
es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
f ∗ (x)
our guess for :=and
f (x), argmin
we E(Y |X)
take the (Y −
one y)2minimizes
that Y (conditioned
| X = x the expected error
y∈Y
pointwise mean (vertical
→ minimizer:
is where f ∗of
the choice (x)the E(Y |X) loss
= square [Y |comes
X = x]
into(regression
play. The function)
expression above is the Frechet
X
(best prediction of Y at point X = x is the conditional mean) x
Xd
thereforeY a=parametric
β0 + approach:
X j βj + ε we assume the underlying predictor f belongs to the family
j=1
β0
X ∈ Rd
0
Linear model for regression
Hyp:
noise Y depends
variable linearlytoonbeXindependent
ε is assumed plus some independent noise X.
from the variable ε: Note that the concep
β0
Xd β1
Y = β0 + Xj βj + ε = [ 1 X ] β + ε where β = .. ∈ Rd+1
T
j=1 .
βd
Y ∈R
Note
Linearthat
predictor:
we do not need
fβ̂ (x) :=ε̂[ in
1 xour predictor since
T ] β̂
β̂0
β0
X ∈ Rd
0
−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor
n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1
−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor
n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1
−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor
n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1
{
gradient vector ∇
ofRSS(β)
of2 RSS
Hessian matrix ∇
β −2 XT (y − X β)
RSS at =
at β= 2 XT X
RSS(β) thispositive
minimizers satisfy
indeed, they
implies semi-definite
XT (ypoints
are critical
⇒ convex
that the functional
− X β)of=the
is convex
0 functiona
but possibly
functional
Ordinary least squares (OLS) estimator
Nondegeneracy assumption: matrix X has full column rank
⇒ 2 XT X is positive definite
−1
⇒ β̂ = XT X XT y is the unique minimizer
−1 T
▶ predictor: fβ̂ (x) = 1 xT β̂ = 1 xT XT X X y
" y1 # T
1 x1
{
gradient vector ∇
ofRSS(β)
of2 RSS
Hessian matrix ∇
β −2 XT (y − X β)
RSS at =
at β= 2 XT X
RSS(β) thispositive
minimizers satisfy
indeed, they
implies semi-definite
XT (ypoints
are critical
⇒ convex
that the functional
− X β)of=the
is convex
0 functiona
but possibly
functional
⇒ 2 XT X is positive definite
−1
⇒ β̂ = XT X XT y is the unique minimizer
−1 T
▶ predictor: fβ̂ (x) = 1 xT β̂ = 1 xT XT X X y y Rn
▶ fitted values:
vd
∀i, ŷi := fβ̂ (xi ) = 1 xTi β̂
··
−1
·
ŷ := X β̂ = X XT X XT y ŷ col X
v0
⟨v0To
, · · see
· , vthat Rn product "
subspace
(ortho.
is precisely
proj. ontothethe
column
linear space
subspace
of X. d ⟩ ⊆ the spannedofbymatrices
the input
does
variables
correspond
this means 0 ≡ v1)
and vthat 0 =
Outline
Optimality
Asumptions:
iid
(x1 , y1 ), · · · , (xn , yn ) ∼ (X, Y ) taking values in Rd × R (regression)
Y = [1 XT ] β+ε (linear model)
expectation
the is taken
OLS over allβ̂possible
estimator E(X,y)(x[β̂]
samplings
is unbiased: =1 ),β · · · (xn , yn ), and computed component-w
1, y
the OLS estimator minimises the MSE among all linear estimators:
β̂ ∈ argmin
Recalling the bias-variance decomposition β∥22
E(X,y) ∥β̃ −theorem from Lecture 5, this result implies that, among
β̃
connection
⇒between MSE
β̂ is a best and unbiased
linear variance comes from
estimator the assumption that the estimators are
(BLUE):
Pn h1i 2
2
• total
a factor of sum of issquares:
n, TSS TSS variance
the empirical Y i. −
:= i=1of (y y −empirical
ȳ) ȳ=is the
Here, ȳ | mean. Another way to
1 2
Pn h1i 2
2
• explained(upsum
is interpretated to aoffactor
squares:
n) as ESS (ŷi − ȳ)
:= i=1variance
the explained ŷ −mean,
= the
around ȳ | where ”explained”
1 2
Pn 2 2
• residual
we have sum of
seen already, squares:
when RSS
we talked about i=1 (ŷi −the
:= minimizing = ∥ŷ
yi )MSE on− y∥27. Indeed, RSS is the squa
page
√ √
In this pictures,
TSS the quantities associated with each edge of the triangle
RSS
vd
√
··
1
ESS
·
1 ŷ − ȳ |
ȳ | 1 v0
1
Evaluation in practice
1
Pn
Let data
given ȳ :=set
n
with yi be the(xempirical
i=1responses 1 , y1 ), · · · mean
, (xn , yresponse,
n ), we canand
run ŷa := the predictions
Xβ̂regression
linear then see how
Pn h1i 2
2
• total
a factor of sum of issquares:
n, TSS TSS variance
the empirical Y i. −
:= i=1of (y ȳ) ȳ=is
Here, y −empirical
the ȳ | mean. Another way to
1 2
Pn h1i 2
2
• explained(upsum
is interpretated to aoffactor
squares:
n) as ESS (ŷi − ȳ)
:= i=1variance
the explained ŷ −mean,
= the
around ȳ | where ”explained”
1 2
Pn 2 2
• residual
we have sum of
seen already, squares:
when RSS
we talked about i=1 (ŷi −the
:= minimizing = ∥ŷ
yi )MSE on− y∥27. Indeed, RSS is the squa
page
notation▶Rcoefficient
2 means that,
ofindetermination: R2 := model,
fact, in the least-square ESS
TSS
1 2−isFVU
=R ∈ [0, 1]
the square of the (sample version
Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)
▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the
▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)
- the observations xi (circled dots)are located in the horizontal plane (d = 2), while the
R d β̂
β̂
β̂
Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)
▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the
▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)
Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)
▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the
▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)
▶ solution
alternative 2: regularized
approach linear
does not regression:
require a preprocessing step, as it tries to align β̂ with
ℓ2 the
n
1 X
easy to •optimize
ridge: - distribution (yi −
of coefficients
β̂ := argmin over
[ 1 the β)2 +variables
i ] input
xT λ ∥β∥22 tends to have full supp
β∈Rd+1 n i=1
n
1 X 2
rder to•optimize
lasso: (non-differentiable)
β̂ := argmin (yi − [ 1 xTi of
+ distribution ] β) + λ ∥β∥over
coefficients 1 the input variables
β∈Rd+1 n i=1 ℓ1
n
1 X
• elasticthe
trade-off between previous
net: two: for small (y
β̂ := argmin i − [values
norm 1 xT the2 ℓ+
i ] β)
1
term ∥β∥22 + (1 −therefore
λ α dominates, α) ∥β∥1 go
β∈Rd+1 n i=1
Ridge regression
n
1 X
β̂ := argmin (yi − [ 1 xT
i ] β)2 + λ ∥β∥22
β∈R d+1 n i=1
▶
{ ∇ · (β) = −2 XT (y − X β) + 2λ β
∇2 · (β) = 2 XT X + 2λ Id+1
−1
positive definite for any λ > 0
This is because the first term is positive semi-defin
⇒ strictly convex functional
▶ β̂ = XT X + λ Id+1 XT y
▶ algorithms:
omposition LU invertible
works on any decomposition,
matrix.Cholesky decomposition
It decomposes the matrix into a product of Lo
— O(n3 ) by Gaussian elimination, O(nω ) by divide-and-conquer
1 X1′
▶ solve linear regression with transformed variables:
Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)
reproducing
Terminology:
property
• feature space H, feature map Φ
• feature vectors Φ(x)
• kernel k := ⟨Φ(·), Φ(·)⟩H : Rd × Rd → R
H
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed
Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)
Prop: The kernel of any RKHS on Rd is unique.uniqueness of the kernel implies uniquene
Conversely, k is the kernel of at most one RKHS on Rd .
▶ Φ(x) = k(x, ·)
Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)
Prop: The kernel of any RKHS on Rd is unique.uniqueness of the kernel implies uniquene
Conversely, k is the kernel of at most one RKHS on Rd .
Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)
R
underlying intuition is the same as for the regularization: the minimizer can be chosen in the
β̂ H β̂
Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)
n n
! n
1X X X
▶ argmin ∥H Lby yits
i , square,
αj k(x j , xi )does
+ not
λ changeαi αanything
j k(xi , xj )since Ω can be chosen
we have replace
α n∥fi=1 j=1
which
i,j=1
" α1 #
only the k(xi , xj ) are
where α = ... required to minimize
αn (kernel trick)
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed
n n
!2 n
X X X
argmin yi − αj k(xj , xi ) + λ αi αj k(xi , xj )
α
i=1 j=1 i,j=1
▶ α̂ = (K + λ In )−1 y
Xn
▶ fˆ(x)
expression α̂j k(xj , x)as follows: instead of fixing the class of the estimator fˆ a prio
can=be interpreted
j=1
left-right symmetry, the line produced by linear regression must be horizontal (i.e.
linear regression the linear
Gaussian predicto
kernel
σ = 0.1
This is linear regression
σ=1 σ = 10
What you should know
- basis functions
- kernels: definition, Moore and Representer theorems (kernel trick)
Linear Models for Classification
Outline:
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
?
X = {images}
Y = {labels}
classification: Y discrete
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)
Y
κ
...
3
X, Y perfectly dependent ⇒ ∃ perfect predictor
2
1
X
Y
κ
...
3
X, Y imperfectly dependent ⇒ ∄ perfect predictor 2
1
X
x
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)
y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...
step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LSH,
easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on pre-training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
f −1 ({1})
colored dots are the input observations with responses. The colored areas are the fibers of
f −1 ({2})
linear nonlinear
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
course ▶
ona linear
linearregression methods,linear
classifier produces decision
we said that a boundaries
method is considered as linear if the co
2 types of approaches:
▶ model
eferring to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA
f −1 ({1})
δ δ3
e show the result of the LDA on the Iris dataset: - left:1 the dataset with the decision bounda
δ2
f −1 ({3})
f −1 ({2})
2 types of approaches:
▶ model
eferring to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
n
X
Fit the model by least squares: B̂ := argmin ∥Z(yi ) − [ 1 xTi ] B∥22
B
i=1
error
surprising, as the ≈ 33%
rate are
classes indeed linearly separable. For
diagonal
comparison,
cross-section
the Bayes error rate
δ̂2
This plot explains the phenomenon: the discriminant function of class 2 never
δ̂3
Linear regression for classification
What is happening:
▶ theythat
assuming sumthe
up set
to 1of(in
observations is centered in Rd (mean= 0)
centered model)
δ̂2
δ̂3
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
forces δ1 (x) ∈ [0, 1]
▶ δ2 (x) := 1 − δ1 (x)
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
”sigmoid” of the logistic
means ”S-shaped” (and bounded) → the logistic sigmoid is but one example
sigmoid:
▶ σ −1 (u) = ln u
1−u
(logit function)
▶ σ(t) + σ(−t) = 1
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
P(Y = 1 | X = x) = δ1 (x) = σ ([ 1 xT ] β1 )
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
▶ model
means that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, then
P(Y = 1 | X = x) σ ([ 1 xT ] β1 ) 1 + exp ([ 1 xT ] β1 )
ln = ln = ln factorize the numerator
= [ 1 xT by
] β1exp(
P(Y = 2 | X = x) σ (− [ 1 xT ] β1 ) 1 + exp (− [ 1 xT ] β1 )
Logistic regression for binary classification
Generalized linear model for discriminant functions:
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
▶ model
means that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, then
n
Y
⇒ L((yi )n n
i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
⇒ log L((yi )n n
i=1 ; (xi )i=1 , β1 ) = log P(Y = yi | X = xi ; β1 )
i=1
Change
is just of variable:
to have a response Z := 1Yvalues
taking =1 ∈ in
{0,{0, ∀i, of
1} 1} instead zi {1, 1yi =1
:= 2}, ∈ {0,
which is 1}
more convenient
exp(t) 1
Logistic regression for binary classification
Modelseen
have already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on densit
n
X
{
1
∇ ℓ(βat
gradient vector 1) β
=1 (zi − σ ([ 1 xTi ] β1 )) xi
i=1
Xn
2
∇ ℓ(β
Hessian matrix at1 )β1=. −
Here we
σ ([use the
1 xT
i ] βfact − σσ([′ 1=xTiσ(1
1 ) (1that ] β− σ) x1i [ 1 xTi ]
1 ))
i=1
negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated a concave
ℓ(β1 )ofispositive function matrices with non-negative
semi-definite
▶ choose
the Hessian matrix is negative
β̂1 arbitrarily in semi-definite, only
the solution set ∇ ℓ(β
of the global
1 ) =maxima
0 annihilate the gradient.
d + 1 non-linear equations in β1
Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix
repeat:
−1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian
Thm: if aHessian
non-singular maximum implies s.t. ∇2 definite
β̄1 isnegative ℓ(β̄1 ) is Hessian, hencethen
non-singular, the β̄
map unique concave
is strictly
1 is ℓthe max.
and,to
close enough forβ̄1an initial
, the β̂1 close
Hessian at β̂enough to β̄1 , convergence
1 is non-singular
to β̄1the
as well, hence is quadratic.
algorithm proceeds
Logistic regression for binary classification
ransition:Degenerate cases
in fact, as the (singular
example Hessian): applying the vanilla logistic regression may lead
is degenerate,
lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fact,
n
X
{
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xTi ] β1 )) xi − 2λ β1
i=1
n
X
2 1
∇ ℓ(β1 ) = − σ ([ 1 xTi ] β1 ) (1 − σ ([ 1 xTi ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1
lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fact,
n
X
{
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xTi ] β1 )) xi − 2λ β1
i=1
n
X
2 1
∇ ℓ(β1 ) = − σ ([ 1 xTi ] β1 ) (1 − σ ([ 1 xTi ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
{
P(Y = 1 | X = x)
ln = [ 1 x T ] β1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
generalized sigmoid
(softmax)
This ∈ [0,is1]the one
terminology
1
δκ (x) := P(Y = κ | X = x) = P
1+ z<κ exp ([ 1 x ] βz )
T
P
= 1 − y<κ
rces (again) the discriminant functions
δy (x) to sum up to 1. Note that, although it is not as direct as
Multi-class logistic regression
the Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes against
{
P(Y = 1 | X = x)
ln = [ 1 x T ] β1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
1
δκ (x) := P(Y = κ | X = x) = P
1+ z<κ exp ([ 1 x ] βz )
T
▶ estimate
expressions for the B by maximum
objective functionlikelihood and
and for the Newton-Raphson’s
iteration steps are more algorithm.
complicated than in the bina
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
▶ the hyperplanes that maximize the margins (closest distances to data points)
Hyperplane equation: xT β − β0 = 0
1
0
=
=
0
β
0
β
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R
−
−
β
xT
β
xT
1
−
1
−
∥β∥
β
▶ β0
is the shift from the origin along β
xT
∥β∥ β
▶ fix
There is one 1
degree
∥β∥
of be
to freedom in the hyperplane equation, since the solution set is invariant under
the margin
⇒ slab
is another boundaries of
consequence have
ourequations xTthat
convention β−β 0 = ±1
1/∥β∥ is set to be the margin.
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.
▶ the hyperplanes that maximize the margins (closest distances to data points)
0
=
=
0
(xi , 1)
0
β
−
2
−
β̂, β̂0 := argmin ∥β∥ subject to:
β
xT
β
xT
1
β,β0
−
(maximize margin)
=
(
0
xTi β − β0 ≥ 1
β
∀i s.t. yi = 1 1
−
∥β∥
β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β
T
⇔ yiis xequivalent (xi , −1)
constraint i β − β0 to
≥ 1the ∀i = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β0 0
∥β∥
▶ the hyperplanes that maximize the margins (closest distances to data points)
0
=
=
βˆ
0
(xi , 1)
βˆ
0
−
2
−
βˆ
xT
β,β0
−
(maximize margin)
=
(
βˆ
0
xTi β − β0 ≥ 1 ∀i s.t. yi = 1 1
−
∥β̂∥
βˆ
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β̂
T
⇔ yiis xequivalent (xi , −1)
constraint i β − β0 to
≥ 1the ∀i = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β̂0 0
∥β̂∥
▶ quadratic programming problem ▶ classifier:
The ˆ
class/label of =
f (x) a query T
x β̂ −x β̂is0 determin
sign point
problems have definite
(w/ pos. quadratic objective
quadratic functions and linear constraints (equalities or inequalities).
form)
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
n o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints
0
=
=
0
0
β
−
parameter
β
xT
β,β0
xT
1
−
=
1 X
n n o
0
β
T 1
max 0, 1 − yi xi β − β0 + λ ∥β∥2
−
∥β∥
n i=1
β
xT
β
Minimize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
▶ λwhen
Indeed, for > 0 classes are linearly
small enough, separable,
the second term in the functional becomes negligible (although
recover problem with hard constraints
by taking λ > 0 small enough.
When classes are not linearly separable
such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.
n o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints
0
=
=
0
(xi , 1)
0
β
▶ slack variables:
−
β
xT
β
n o
xT
1
−
T
Measures the ξloss
i :=on 0, 1 −
the i-th
max yi (xi β − β0 )
constraint
0
=
β
1
−
∥β∥
▶ substitution:
β
xT
β
n
X1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1
subject to:
∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
n o
is (∀i, optimum
because we canturns one inequality
optimize each ξ independently ⇝ ξ̂ithe
into an equalityfrom = max 1 − yiβ,
others0,once (xβT
− β̂0 )been
i 0β̂ have ) fix
n o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints
0
=
=
0
T
in ξthat case is on
xiwe 1 − yside
wrong
have i (xiof − β0boundary
β slab
0
β
−
−
β
These are
▶ ξthe
ˆi =vectors that count to define
support vectors
xT
0:
−
=
1
−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side
β
xT
β
n
X1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1
subject to:
∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
n o
is (∀i, optimum
because we canturns one inequality
optimize each ξ independently ⇝ ξ̂ithe
into an equalityfrom = max 1 − yiβ,
others0,once (xβT
− β̂0 )been
i 0β̂ have ) fix
When classes are not linearly separable
such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.
n o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints
0
=
=
0
Indeed, ▶ ˆi > 0: ) = ξ > 0. (xi , 1)
β
T
in ξthat case is on
xiwe 1 − yside
wrong
have i (xiof − β0boundary
β slab
0
β
−
−
β
that count to define the slab. Indeed, each equality yi (xTi β̂ − β̂0 ) = 1
xT
β
These are
▶ ξthe
ˆi =vectors
xT
1
0:
−
=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary
0
β
1
−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side
β
xT
β
n
X 1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1
∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
▶ One-vs-all:
• assign
class whose corresponding classifier gives
each new observation d
x ∈ Rthetohighest score
the class xT bet
wins the
argmax β̂ y −forβ̂0yx. This requires
y=1,··· ,κ
▶ One-vs-one:
′ ′
Thus there train( κ21)classifier
• are (β̂ y,y , β̂0y,y
binary classifiers, i.e.) aforquadratic
each pairquantity in the
of classes y ′ ∈ {1,of
y ̸=number · · classes
· , κ},
to discriminate
Here, discrimination y from
is among y ′ in their jointspanned
the subpopulation subpopulations.
by the observations with labels y o
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Kernel SVM
1.5 1.5
1 1
0.5 0.5
SVM
0
The linear classifier in the data space perfo
0
-0.5 -0.5
-1
-1
-1.5
-1.5
Φ : x 7→ [ x x ] 2
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
0
Kernel SVM
Quadratic
present program
the hard margin (hard
case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
—may be infinite-dimensional)
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
Kernel SVM
Quadratic
present program
the hard margin (hard
case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
n X n n
!
X X
merely substitute β αfor
argmin i ythe linear
i k(x yj αj subj. to
i , xj )combination of yΦ(x
i We
i )’s. α j yjthus , xj )a−new
k(xiget ≥ 1 ∀i p
β0 quadratic
α,β0
i=1 j=1 j=1
Pn Pn
Representer Thm =⇒ β̂ = i=1 αi yi Φ(xi ) = i=1 αi yi k(xi , ·)
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
0
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
dashed lines
SVMarewith
the slab’s
deg.-4boundaries,
polynomial as
kernel
before. The decision
SVMboundary
with Gaussian
of the kernel
Bayes classifier
rmance of theerror
Gaussian
rate ≈kernel
24.5%is particularly good here. This
errorisrate
explained
≈ 21.8%by the fact that
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
logistic
logistic
regression
reg. with
in feature
deg.-4space
polynomial
is regularized.
kernel λ and
logistic
the reg.
window
withsize
Gaussian
for thekernel
kernel have
again, the performance
error rate of the Gaussian kernel is particularlyerror
≈ 26.3% good,
ratedue to the fact that the
≈ 22.1%
What you should know
• Two types of linear approaches for classification
• Logistic regression:
- generalized linear model
- fitting by likelihood maximization & Newton-Raphson’s method,
convergence guarantees
- degenerate cases & regularization
- extension to multi-class
Outline:
x1
x2
y
···
xd−1
xd
Historical landmarks
Historical landmarks
1969 1986
Historical landmarks
1985 1986
Historical landmarks
1989 1989
Historical landmarks
Historical landmarks
1998 2006
Historical landmarks
‘Deep learning
the expression conspiracy’:
that the protagonists themselves employ to designate their strategy to put neural
Historical landmarks
Deep
slightly go backlearning
in time,and
frombig data:
2012 to 2009, for the sake of theme consistency
h i
T −β0
classifier: y(x) := sign [ 1 x ]
associated classifier is the indicator function
β on the positive half-space bounded by the h
β0 0
∥β∥
Rosenblatt’s perceptron algorithm
▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0
▶ distances
Note: the fits the here
modelareby minimizing
absolute, the distances
not signed, hence theof
formula below.
the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified
{
at β,β0
∥β∥ × signed
choice of ∥β∥ is a free parameter (the hyperplane remains the same if ∥β∥
distance to hyperplane β and β0 are both multiplied
β0
h i ∥β∥
T −β̂0
classifier: ŷ(x) := sign [ 1 x ]
ractice the classifier uses the estimated β̂parameters β, β0 .
0
β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that
{
∂ X β0
∥β∥
=− yi x i
∂β
xi misclassified
at β,β0
we are standing at a particular position (β, β0 ), hence the set of 0misclassified points is fixed. Differentia
∂ X
= yi
∂β0
xi misclassified
at β,β0
Rosenblatt’s perceptron algorithm
▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0
β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that
h i β0
init: set β̂0 at random ∥β∥
β̂
(gradient descent)
repeat:
compute misclassified set M 0
h i h i P −y
β̂0 β̂0
ϱ ∈ [0, 1] is the←−
step size (or
+ ϱlearning
xi ∈M yi xof
rate) i
i
the algorithm.
β̂ β̂
a threshold
until isconvergence
not needed if a//
separating
requireshyperplane actually
convergence exists (see below).
threshold
β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that
h i β0
init: set β̂0 at random (stochastic)
”stochastic gradient descent”
∥β∥ is also called ”iterative
β̂
(gradient descent)
repeat:
Foreach i = 1, · · · , n do 0
h i h i −y h i
β̂0 β̂0
difference If
with
yi theT β̂ − β̂0 gradient
xi classical < 0 thendescent ←−
is that a step
+ ϱ isytaken
i
i xi
(and thus M β̂is0 updated)
// update and M after
β̂ β̂ β̂
β
Thm:itself
erceptron [Rosenblatt 1960] [Novikoff
was proposed 1962] in 1957. The convergence in finitely many steps
by Rosenblatt
0 β
• If the two classes are linearly separable, then ∥β∥
Converging stochastic gradient
of the energy descentthat
to 0 means with 1 makes the
ϱ =separating
some hyperplane is found.
energy converge to 0 in finitely many steps.
0
• More precisely, if ∃ separating hyperplane with
the smaller
marginthe optimal
γ and if ∥xmargin, the longer time it takes for the algorithm to converge.
i ∥ ≤ R ∀i = 1, · · · , n, then
convergence occurs after O(R2 /γ 2 ) steps.
a sum▶oflinear structure
terms, of depending
each one energy permits
on a the of stochastic gradient descent
useobservation
single
▶ stochastic and
input observations gradient arescales
descent
responses up well
processed and allows
in sequence. forneed
- No re-training
to restart the training
Despite these drawbacks, the approach remains highly appealing thanks to the stochastic gradient
Drawbacks:
This is in ▶ no unique
contrast solution
to SVM, ⇝ solution
which provides depends onmaximizing
separators initializationthe margin. In the example
- convergence
is in contrast to SVM, to irrelevant
which configurations
provide separators with best margin even in this case.
x2 β2 " x1 #
.. β is
y= signx x=T β −
Here, 0 the vector obtained by collating
.
···
xd
{
β0
xd−1 βd−1
βd activation (linear)
xd bias ∈ R
Connectionist viewpoint on the perceptron
1
sigmoid: t 7→ ⇝ neuron implements logistic regression
1 + exp(−t)
exp(xi )
softmaxi : (x1 , · · · xd ) 7→ P ⇝ produces outputs in [0, 1]
j exp(xj )
outputs sum up to 1
Feedforward,
eedforward: full connectivity
signal moves from left to between consecutive
right (arrow layers:
heads are omitted in the picture to avoid
x1 1
0
σ 1,1 σ 1,s
β01,1
··· β 1,s
0
d
thex2coordinates of the input observation x ∈ R (and not a collection of d ϕ
1 observations).
1 y1
0
σ 2,1 σ 2,s
β02,1
··· β02,s The yj ’s are the
···
···
···
···
ϕκ yκ
σ r1 ,1 σ rs ,s
β0r1 ,1
··· β0rs ,s
xd 1
0
neurons in the input layer are mere identity functions : all they do is forward their (unique)
input layer Hiddens means that neurons
hidden layers are per
(ri neurons notlayer
connected
i) to
The input
output nor
layeroutput
neurons va
in the output
β01,1
··· β01,s
x2 1 The softmax functions
smax1 ensure
y1
0
β02,1
··· β02,s
···
···
···
···
smaxκ yκ
β0r1 ,1
··· β0rs ,s
xd 1
0
input layer s hidden layers (ri neurons per layer i) output layer
β01,1
··· β01,s
x2 1
0
β02,1
··· β02,s
0
y
···
···
···
β0r1 ,1
··· β0rs ,s
xd 1
0
input layer s hidden layers (ri neurons per layer i) output layer
Approximation power
The 1-hidden layer case: r
X γj
y(x) =
1 + exp β0j − xT β j
{
1
0
j=1
β1 β01
1
0
β2 γ1
β02
x γ2
0
y
γr
···
···
βr
β0r
InThm (Universal
words, this resultApproximation): [Cybenko 1989]
means that the functions y(x) pro
1
0 For any continuous function f : Rd → R with
compact support X, and any ε > 0, there exist
a hidden layer size r ∈ N and parameter values
β0j , β j , γj for 1 ≤ j ≤ r such that
|f (x) − y(x)| ≤ ε ∀x ∈ X.
Outline
Training
Input: (x1 , y1 ), · · · , (xn , yn )
Xr
T j j
γj σwe
y(x) = Here j x β −
consider β
general
0 activation
{
1
0
j=1
σ1
β1 β01
1 arbitrary activation function
0
β2
σ2
γ1
β02
x γ2
0
y
γr
···
···
βr σr
β0r
1
0
Training
Input: (x1 , y1 ), · · · , (xn , yn ) σ
(xT β 1 −β01 ) T
1
..
y(x) = We express sum
the γ as a dot p
.
{
1
0 σr (xT β r −β0r
)
σ1
β1 β01
1
0
β2
σ2
β02
x γ 0
y
···
···
n
X
βr σr RSS = (yi − y(xi ))2
{
β0r i=1
This
Ri is the con
σ T 1 1
1 (
1 xi β −β0 )
0
.. impulse from
∇γ R i = −2 (y i − y(x i ))
gradient’s formula is of the same form at each neuron: ∇ = error. · impulse The hidden
previous neurons
layer
error at σr ( x T r r
i β −β0 )
The output
current per se can be seen as the next layer of th
neuron
′ T j j
∇β j Ri = −2 (yi − y(xi )) γj σj xi β − β0 xi
Training
Input: (x1 , y1 ), · · · , (xn , yn ) σ
(xT β 1 −β01 ) T
1
..
y(x) = We express sum
the γ as a dot p
.
{
1
0 σr (xT β r −β0r
)
σ1
β1 β01
1
0
β2
σ2
β02
x γ 0
y
···
···
n
X
βr σr RSS = (yi − y(xi ))2
{
β0r i=1
weight= 1
This
Ri is the con
σ T 1 1
1
id′ = 1 (
1 xi β −β0 )
0
∇γ Ri = −2 (yi − y(xi )) ..
.
σr ( x T r r
i β −β0 )
back-propagated
This errorsign. The reason is that here we chose β
is up to a minus
′ T j j
∇β j Ri =the
is computed by back-propagating −2 error
(yi −from
y(xi the
)) γjnext − β0 xi by the weight of
xi β weighted
σj layer,
weight of j-th hidden neuron in output neuron
Training
Input: (x1 , y1 ), · · · , (xn , yn ) gradient at each neuron:
∇β Ri = err · z
back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···
σ
···
βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Training(”stochastic”
randomness epoch (stoch. gradient
term) comespass):
from the fact βthat
β ←− − ϱthe
∇β Rorder
i
in which the training set
learning rate
▶ sweep through the training set The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i
Training
Input: (x1 , y1 ), · · · , (xn , yn ) gradient at each neuron:
∇β Ri = err · z
back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···
σ
···
βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Online learning: β ←− β − ϱ ∇β Ri
learning rate
▶ perform multiple training epochs The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i
back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···
σ
···
βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Online learning: β ←− β − ϱ ∇β Ri
learning rate
▶ scales up well The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i
Regularization
The high number of parameters in neural networks usually leads to overfitting.
error
test
training
# training epochs
Regularization
The high number of parameters in neural networks usually leads to overfitting.
error
test
training
# training epochs
Regularization
The high number of parameters in neural networks usually leads to overfitting.
···
···
···
···
···
···
Approach 2: dropout:
▶ at each
The neurons are indeed
trainingcompletely switched
epoch, randomly off:offtheir
switch activations
a fraction and gradients
of neurons are set
in each layer
▶ replaces the full model by a series of random simplified models
▶ select fraction of switched-off neurons by cross-validation
Regularization
The high number of parameters in neural networks usually leads to overfitting.
···
···
···
···
···
···
Regularization
The high number of parameters in neural networks usually leads to overfitting.
···
···
···
···
···
···
▶ dataset down-sized to 320 images for training and 160 for testing
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units
16 × 16
1 × 10
1 × 12
16 × 16
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units
1 × 10
16 × 16 8×8 4×4
8×8
1 × 10
16 × 16 8×8 4×4
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units
Net-2 87%
Reminder: accuracy = success rate Net-3 88.5%
Net-4 94%
Net-5 98.4%
# training epochs
What you should know
Outline:
• Vector quantization
• Vector quantization
Rn
Data Features ∈
feature design
or learning
···
revious lectures
Input: we talked
data space about
D (can d
be Rkernels
, spacefor data sitting
of graphs, of 3d in Rd , e.g.
shapes, etc.)the Gaussian kernel.
2 classes of approaches:
revious lectures
Input: we talked
data space about
D (can d
be Rkernels
, spacefor data sitting
of graphs, of 3d in Rd , e.g.
shapes, etc.)the Gaussian kernel.
▶ area-specific
▶ overview:
overview is intended1-2
to approaches
help the students gettype
per data started
(cf. for their upcoming
specialized projects or internships.
3A courses)
Outline
• Vector quantization
Text features
Bag-of-words model:
▶ remove
These unbalace stop-words
the distribution
Example:
”Humans come down from the apes, the ape comes down from the tree.”
Bag-of-words model:
▶ remove
These unbalace stop-words
the distribution
theComplement:
context are theword2vec (neural
neighboring netintrained
words to predict
the input text. Inword
fact from context
word2vec usesBoW)
a ”continuous
Outline
• Vector quantization
Graph features
(undirected: A = AT )
2 4 "0 1 1 0 0#
Example: (n = 5) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
3 5 0 0 0 1 0
Graph features
Graphlets:
ortant ▶tocount
note that
the number
the graphlet
of occurrences
must be embedded
of each graphlet
as a subgraph
as an induced
inducedsubgraph
by its vertices,
2 4 "0 1 1 0 0#
Example: (n = 5, k = 3) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
X = (1, 3, 6, 0) 3 5 0 0 0 1 0
Graph features
Graphlets:
ortant ▶tocount
note that
the number
the graphlet
of occurrences
must be embedded
of each graphlet
as a subgraph
as an induced
inducedsubgraph
by its vertices,
2 4 "0 1 1 0 0#
Example: (n = 5, k = 3) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
X ′ = ( 10
1
, 3
, 6 , 0) 3 5 0 0 0 1 0
10 10
Graphlets:
ortant ▶tocount
note that the graphlet
the number must be embedded
of occurrences as a subgraph
of each graphlet inducedsubgraph
as an induced by its vertices,
Props:
▶ G1 ≃ G2 ⇒ X ′ (G1 ) = X ′ (G2 )
▶ the converse
in general, we do loseholds
someforinformation ≤ the
n = k + 1 in 11 but not in general
process.
Graph features
Graphlets:
ortant ▶tocount
note that the graphlet
the number must be embedded
of occurrences as a subgraph
of each graphlet inducedsubgraph
as an induced by its vertices,
Computation:
• Vector quantization
Image features
Input: Images via intensity maps I : Z2 → R, one for each color channel
Input: Images via intensity maps I : Z2 → R, one for each color channel
step a) choose
makes the scale
featureσ:scale-invariant (as the name says)
▶ compute
for instance bandwidths rangingat
convolutions over a logarithmic
x with Gaussian scale
kernels of various bandwidths σi
▶ compute
these differences the differences
approximate between
the gradient of thethe convolutions
convolution at bandwidths
w.r.t. the scale parameter
σi , σi+1 σ
difference is▶integrated over the
select scale(s) mask’s
σ with domaindifference
maximum
scale
Gaussian masks x
Image features
Input: Images via intensity maps I : Z2 → R, one for each color channel
step b) choose
makes the orientation:
feature rotation-invariant (desirable property as well)
▶ compute intensity gradient at each pixel y in a window of size ∝ σ around x
▶ build histogram of gradient directions (36 bins, 10 degrees each)
▶ assign orientation corresponding to highest peak in histogram
▶ rotate image so assigned orientation is vertical
y
x
Image features
Input: Images via intensity maps I : Z2 → R, one for each color channel
c) compute feature:
▶ subdivide
size of the window is16 × 16a window
fixed around
priori, i.e. x into 16
independent the scaleofσsize 4 × 4
of patches
▶ compute histogram of gradient orientations (8 bins) in each 4 × 4 patch
▶ collect the 8 × 4 × 4 = 128 values (weighted by Gaussian at x) into a vector
Image features
Input: Images via intensity maps I : Z2 → R, one for each color channel
c) compute feature:
▶ subdivide
size of the window is16 × 16a window
fixed around
priori, i.e. x into 16
independent the scaleofσsize 4 × 4
of patches
▶ compute histogram of gradient orientations (8 bins) in each 4 × 4 patch
▶ collect the 8 × 4 × 4 = 128 values (weighted by Gaussian at x) into a vector
• Vector quantization
3d shape features
▶ non-canonical reprensentation
h
h θ X
r
3d shape features
• Vector quantization
▶ may
behavior in this
becontext
chaotic,comes
irregularly
from sampled,
incommensurate
multivariate,
frequencies.
hard to -realign,
irregularetc.
sampling mak
features have been proposed, notably for periodic time series: coefficients of Fourier or wavelet
Time series features
Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f
(time-delay
embedding)
f (t) τ : step / delay
f (t+τ )
TDm,τ (f, t) := .. mτ : window size
.
f (t+mτ )
m + 1: embedding dimension
Then our ▶
time
point
series
cloud in Rm+1
becomes a regular
(time ispoint
forgotten
cloud about)
in Euclidean space, where each point (obs
Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f
(time-delay
embedding)
periodicity circularity
comes (remotely)
max. frequency from
(ν) Shannon’s
min.theorem.
ambient The intuition is that, the larger the frequency
dimension
(m ≥ 2ν)
Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f
(time-delay
embedding)
• Vector quantization
Curse of dimensionality
Dimensionality reduction
Rd
Example: set of 4096-dime
data points, representing
els images of a same object,
der various lighting an
gles. (from Isomap, Science
Rk
Dimensionality reduction
Dimensionality reduction
This isAawealth
very old and rich topic, so here I am only listing a few popular approaches
of approaches:
▶ linear
NMF stands for non-negative matrix factorization
(PCA, MDS, NMF)
• Vector quantization
Vector quantization
▶ k-means:
dj (xi ) = 1xi ∈Xj
featureApplications
extraction,ineach datum
feature x is one feature extracted from one observation, and it
extraction:
single▶feature
reduce extracted
dimensionality
fromofanfeature space this feature can be mapped to the vecto
observation,
▶ encode
can also set of features
do something as a distribution
more subtle: when several different features are extracted from
over the codebook (bag-of-features)
▷ pooling of word functions dj (·)
Vector quantization
P
k-means quantization the word
SIFT their values in {0, 1}, so sum-pooling simply
functions+take-pooling
+ k-means
# occurrences
codewords
Outline
• Vector quantization
Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
T
x1
Input: X = .. ∈ Rn×d , k ∈ N
.
xT
n
Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
T
x1
Input: X = .. ∈ Rn×d , k ∈ N
.
xT
n
1
Pn
Prop: H contains the centroid x̄ := n i=1 xi
proof:
n n
1 X 2 1 X x̄
∥xi − πtheorem
is by Pythagoras’ E (xi )∥2 = ∥xi − πE (x̄)∥22 − ∥πE (x̄) − πE (xi )∥22
n i=1 n i=1 xi
{
{
inv. under trans. of E
1 X
n
∥xi − x̄∥
develop the square, 2
then ∥x̄ − πthat
2 +notice the
E (x̄)∥
{
2
2 cross-term (dot
min. when product)
x̄ ∈ E
E
vanishes thanks to the fact that x
n i=1 □
Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
T
x1
Input: X = .. ∈ Rn×d , k ∈ N
.
xT β1
n
Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
T
x1
Input: X = .. ∈ Rn×d , k ∈ N
.
xT β1
n
I 0
where B = [ β1 ··· βd−k 0 ··· 0 ] ∈ Rd×d s.t. BT B = d−k
0 0
I
▶ B =as
expressed d−k 0
Wa full-rank
0 0
where WT W
orthogonal matrix
= Idright-composed with an orthogonal projection
I 2
=⇒ ∥X B∥2F = UT D V W d−k
0
0
0 F
Indeed, this
is minimum whenchoice
W =places
VT . the lowest eigenvalu
Resolution of least-squares problem
I
{ H = ker VT d−k 0
I 0 0
B̂ := argmin ∥X B∥2F = VT d−k 0
=⇒
B
0 0 T 0 0
πH (x 1 , · ·is· the
This , xn )matrix
= XV of the0 Icoordinates
k
Geometric interpretation:
▶ VT aligns the frame with the principal directions of the covariance matrix
0
▶ 0
0 Ik projects onto the principal directions of largest eigenvalues (variances)
β1
h i
0 0
× VT × 0 Ik
Experimental result
Dataset:
variables corresponding to runtimes
1988 Olympics decathlonhave
(10 been negated,
variables, so that performance increases with
34 observations)
Experimental result
runners
throwers
ectrum suggests
Spectrum
that most
of SVD
of the
diagonal
variance
matrix
should
D be explained embedding
by the firstinto Rk with
2 intrinsic variables
k=2
(ordered by decreasing value)
What you should know
• Vector quantization