0% found this document useful (0 votes)
0 views

INF442-DataScienceBooklet (1)

Uploaded by

yassinensiri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

INF442-DataScienceBooklet (1)

Uploaded by

yassinensiri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 248

INF442

Algorithms for Data Analysis in C++

Steve Oudot

[email protected]

Data Science slides booklet


Introduction to Data Science
The big data era

Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation

The big data era

Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation

▶ correlated with World’s storage capacity


— data centers and cloud (45% - 55% in 2025)
The big data era

Key figures:
▶ size of ‘global data sphere’ (including 10% of unique data):
predict.
2 zB
global (2010)
data → 79
sphere = summation → all 181
zB (2021) of zB (2025)
created, (1 zBor
captured 1021 Bytes) data in the
= replicated
— source: International Data Corporation

▶ correlated with World’s storage capacity


— data centers and cloud (45% - 55% in 2025)

▶ exponential growth (+ 30% each year on average)


— expected to be sustained on the long run

▶ small fraction of data is processed/analyzed


— shortage of trained data scientists

Data production
Data are produced at an unprecedented rate by:
▶ Industry / Economy
▶ Sciences

▶ End users
Challenges

Complex data Corrupted data


(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)

Big data
(streamed, online, distributed)

Data science’s celebrated successes...

AI for games:

1997: IBM’s Deep Blue wins chess match


against world champion G. Kasparov

2016: DeepMind’s AlphaGo wins Go match


against 18-time world champion Lee Sedol

2019: DeepMind’s AlphaStar beats


Starcraft II professional players
Data science’s celebrated successes...

ImageNet Challenge:

▶ database of 40 · 106 + images, structured in 20 · 103 + categories

▶ images collected on the Internet

▶ annotation process crowdsourced


to Amazon Mechanical Turk

[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009]

Data science’s celebrated successes...

ImageNet Challenge:

▶ annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

▶ until 2011, classification error rates around 25%

▶ 2012:
this network more than 60−→
hasbreakthrough million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

rrow ▶ by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012]
Data science’s celebrated successes...

ImageNet Challenge:

▶ annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

▶ until 2011, classification error rates around 25%

▶ 2012:
this network more than 60−→
hasbreakthrough million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

▶ unsupervised
unsupervised pre-training ≃ using auto-encoders
pre-training as learning
leads to concept feature generators to face,
(e.g. human be plugged into sup
cat face)

Here the auto-encoder has 9 layers and 1

[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012]

Data science’s celebrated successes...

ChatGPT (Generative Pre-trained Transformer):

▶ chatbot released by OpenAI

▶ based on a Transformer neural network architecture

▶ Human
from pre-trained
Feedback
by Reinforcement
(RLHF) Learning from Human Feedback (RLHF)

▶ able to generate realistic texts


of various types: essays, poetry,
recipes, code, etc.

▶ able to make accurate text


translations, summaries, senti-
ment analysis, etc.
... and notorious failures

Microsoft’s Tay:

▶ AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

▶ learned from its interactions with people

▶ shut down only 16 hours after launch

▶ produced inflammatory, offensive (racist, sexually-charged) tweets

▶ training overrun by trolls

▶ numerous questions raised


(technical, legal, ethical)

... and notorious failures

Other notorious recent AI failures:

Amazon’s AI recruiting tool proven to be


gender-based

Uber’s self-driving car kills pedestrian


in Arizona
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers)

What is data science?

Core topics:

▶ statistical analysis

▶ machine learning / deep learning

▶ pattern recognition

▶ data mining

▶ optimization (convex / combinatorial)

▶ database management and distributed systems

▶ high-performance computing (streaming, distributed, cloud)


Data?
Datum ≡ observation ≡ ”chunk of information”

Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
Note: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de traitem
{
v1 ··· vd

{
x1
coordinate
···

observations
matrix
xn

categorical variables: 1, 2, · · · , K (arbitrary labels)


(e.g. ”cat”, ”dog”, ”horse”)

continuous variables: real or complex values


(e.g. temperature, pressure, geographic coordinate, income, amplitude/phase)
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
Note: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de traitem

{
x1 ··· xn

{
Metric representation x1
distance /

···
observations (dis-)similarity
matrix
xn

distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.

(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.

Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
Note: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de traitem

Metric representation
Rd

?
···

feature extraction
Programming languages for data science

▶ Databases / data manipulation: Structured Query Language (SQL)

note: all other modern languages are built on SQL (e.g. QBE is in fact just a front-end

Programming languages for data science

▶ Databases / data manipulation: Structured Query Language (SQL)

▶ Data analysis: Python (CS) / R (stats)

note: what is taught is: (1) principles of each approach; (2) how to apply it in Python
Programming languages for data science

▶ Databases / data manipulation: Structured Query Language (SQL)

▶ Data analysis: Python (CS) / R (stats)

▶ Effective data processing:


C / C++ / CUDA (GPGPU)

[...]

Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
▶ classification (categorical labels)
▶ regression (continuous labels)
▶ forecasting (regression on time series)
energy consumption

? weather parameters
Learning paradigms

Unsupervised learning
Input: data without labels

Goal: identify patterns, correlations

Typical problems:
▶ clustering
▶ dimensionality reduction
▶ anomaly detection / noise removal

Learning paradigms

Unsupervised learning

Semi-supervised learning (only a fraction of the input data has labels)

Supervised learning
Learning paradigms

Reinforcement learning
Input: Markov decision process:
▶ agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


Notes: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

▶ exploration vs. exploitation


(e.g. multi-armed
ypically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforcement

▶ control learning

Learning paradigms

Reinforcement learning
Input: Markov decision process:
▶ agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


Notes: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

▶ exploration vs. exploitation


(e.g. multi-armed
ypically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforcement

▶ control learning

[Géron 2017]
Learning paradigms

(source: NVIDIA)
Nearest-Neighbors Search

Outline:

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality


Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Nearest neighbor search

pre-processing input: P

query input: q

goal: find p ∈ NNP (q)


NNP (q)
d(q, p) = minp′ ∈P d(q, p′ )
q

d(q, P )
Nearest neighbor search

Variants:

• k-nearest neighbors: find the k points closest to q in P

• r-nearest neighbor: find a point p ∈ P such that d(q, p) ≤ r

• metrics:
▶ ℓ 2 , ℓp , ℓ∞

▶ strings: Hamming distance

▶ images: optimal transport distances

▶ point clouds: (Gromov-)Hausdorff distances

▶ proteins: RMSD distances

▶ ···

A fundamental problem across data sciences

• clustering, e.g. k-means, mean-shift

• information retrieval in databases

• information theory, e.g. vector quantization

• supervised learning, e.g. NN-classifiers

• ···

?
Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Linear scan

Input: P = {p1 , · · · , pn } ⊂ Rd , q ∈ Rd

dmin := ∞ (dist. to nearest neighbor among the pts viewed so far)

for i = 1 to n do:

dmin := min {dmin , d(q, pi )}

return dwe
the following, or index
minwill that achieves
usuallyi record only d, dnot
min i, to simplify the pseudo-code; storing the

Complexity:

space: O(d n) — n points, d coordinates each

time: O(d n) — n iterations, 1 distance computation each


Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Strategy and challenges

Strategy:
▶ preprocess the n point of P in Rd into some data structure DS
for fast nearest-neighbor queries answers

Ideal wish list:


▶ DS should have linear size in n and polynomial size in d
▶ a query should take sublinear time in n and polynomial time in d
e.g. binary search trees in d = 1: linear size, O(log n) time

Core difficulties:
▶ concentration
Detail Curse of dimensionality: hardintohigh
of distances outperform linear[Demartinez’94]
dimensions scan in high d
▶ Interpretation: meaningfulness of distances in high d (concentration)
Popular approaches
(quadtree) (k-dtree) (RP-tree)
this is• the
Linear scan
baseline O(dn) space and time

• Voronoi diagrams

• Tree-like data structures


▶ quadtrees (split at midpoint in all coordinates)

▶ tries / dyadic trees (split at mean, cycle around coordinates)


Binary
Space ▶ kd-trees (split at median, cycle around coordinates)
Partitions

▶ Random Projection trees (split at median along random coordinates)

▶ PCA trees (split at median along 1st eigenvector of covariance matrix)

▶ ···

• Locality Sensitive Hashing


Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

kd-tree
rigin of the name: in the original 1977 paper [Friedman, Bentley, Finkel: An algorithm for finding b

• a binary tree encoding a hierarchy of binary space partitions

• each internal node implements a binary spatial partition induced by a


hyperplane H, dividing the point cloud into three subsets:
▶ node’s data: a distinguished point lying on H

▶ right subtree: all points lying on one side of H

▶ left subtree: all other points

Thus each•leaf
subdivision
represent stops whenever
a trivial partitionfewer thana nsingle
(i.e. with 0 remain
subset) with ≤ n0 points
⇝ size: O(dn)

(n0 = 1)
kd-tree
rigin of the name: in the original 1977 paper [Friedman, Bentley, Finkel: An algorithm for finding b

• a binary tree encoding a hierarchy of binary space partitions

• each internal node implements a binary spatial partition induced by a


hyperplane H, dividing the point cloud into three subsets:
▶ node’s data: a distinguished point lying on H

▶ right subtree: all points lying on one side of H

▶ left subtree: all other points

Thus each•leaf
subdivision
represent stops whenever
a trivial partitionfewer thana nsingle
(i.e. with 0 remain
subset) with ≤ n0 points
⇝ size: O(dn)

kd-tree specifics:
several variants
H of orthogonal
is the to coordinate
construction axis
exist: - the historical
(possible
one
choices:
choosescyclic
at each
iteration,
step the
max
coordinate
spread) along

H goes
particular, the through
median the median
rule implies in the
that the considered
kd-tree coordinate
is balanced
(n0 = 1)

Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node
Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Recursive construction

2 types of nodes:

- leaf: contains a batch of ≤ n0 points


- internal: contains a coordinate axis and value for splitting,
plus a point and refs. to 2 children

build (P , c): (P = current point cloud, c = coordinate selected for splitting)

if #P ≤ n0 then create a leaf storing P

else:
compute median m of {p[c] : p ∈ P } and p∗ ∈ P such that p∗ [c] = m

compute Pl = {p ∈ P | p[c] ≤ m} \ {p∗ } and Pr = {p ∈ P | p[c] > m}

create an internal node storing c, m, p∗ , and refs to the nodes given by


dimensions(cyclic iteration)
are indexed build (Pl , (c + 1)%d) and build (Pr , (c + 1)%d)
from 0 to d − 1 included
Recursive construction
Complexity analysis: 1 call to build on each node of the tree
2 types of nodes: this is⇒because
O(n) calls
each(1partition
point charged perpoint
splits the node)cloud
here P is the subcloud
O(median(#P ) +of#P ) for considered
points each call
- leaf: contains a batch of ≤ n0 points
- internal: contains a coordinate axis and value for splitting,
plus a point and refs. to 2 children

build (P , c): (P = current point cloud, c = coordinate selected for splitting)

if #P ≤ n0 then create a leaf storing P

else:
compute median m of {p[c] : p ∈ P } and p∗ ∈ P such that p∗ [c] = m

compute Pl = {p ∈ P | p[c] ≤ m} \ {p∗ } and Pr = {p ∈ P | p[c] > m}

create an internal node storing c, m, p∗ , and refs to the nodes given by


dimensions(cyclic iteration)
are indexed build (Pl , (c + 1)%d) and build (Pr , (c + 1)%d)
from 0 to d − 1 included

Recursive construction

Median computation:
• by sorting the points of the current cloud P

▶ median(#P ) is then in O(#P log #P ) = O(#P log n)

▶ total complexity: C(n) = n log n + 2 C(n/2) ⇝ O(n log2 n)

• by pre-sorting all the points along each coordinate

this you need ▶


to median(#P
encode P through
) is thenit in
d sorted
O(#P lists/arrays,
) after O(d nduplicating
log n) preprocessing
the relevant halves of the

▶ total complexity: C(n) = d n + 2 C(n/2) ⇝ O(d n log n)

• by linear
The randomized medianis(randomized
linear median based on the or deterministic,
same cf. INF562)
idea as QuickSort with a randomized choice of

▶ median(#P ) is then in O(#P ) with no preprocessing

▶ total complexity: C(n) = n + 2 C(n/2) ⇝ O(n log n)


Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Usage for NN search

Strategy 1: defeatist search


dmin := ∞ (min dist. to pts viewed so far)

search (node): (node = root initially)

if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}

else:
dmin := min{dmin , d(q, node.point)} Important: show on the picture which cells of the subdivision
q

if q on ”left” side of node.H


defeatist

search (node.lef t)

else (q on ”right” side of node.H)


search (node.right) (n0 = 1)
Usage for NN search

Strategy 1: defeatist search


n n
dmin := ∞ (min dist. to pts viewed soroughly
far) Query
n/n0time:
leaves,
O(dso(n
height≈
0 + log log )) (complete
n 0 n0

search (node): (node = root initially) May fail!

if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}

else: q′
dmin := min{dmin , d(q, node.point)} Important: show on the picture which cells of the subdivision
q

if q on ”left” side of node.H


defeatist

search (node.lef t)

else (q on ”right” side of node.H)


search (node.right) (n0 = 1)

Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node
Usage for NN search

Strategy 2: backtracking search


dmin := ∞ (min dist. to pts viewed so far)

search (node): (node = root initially)

if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}

else: q′
dmin := min{dmin , d(q, node.point)}

if B(q, dmin ) intersects ”left” side of node.H


backtracking

search (node.lef t)

if B(q, dmin ) intersects ”right” side of node.H


search (node.right) (n0 = 1)

Usage for NN search

Strategy 2: backtracking search


dmin := ∞ (min dist. to pts viewed so far) Always succeeds
dmin ≥ d(q, NN(q)) ⇒ B(q, dmin )
search (node): (node = root initially)
intersects all cells containing NN(q)
in subdivision throughout search
if node = leaf :
dmin :=min{dmin , minp∈node.batch d(q, p)}

else: q′
dmin := min{dmin , d(q, node.point)}

if B(q, dmin ) intersects ”left” side of node.H


backtracking

search (node.lef t)

if B(q, dmin ) intersects ”right” side of node.H


search (node.right) (n0 = 1)
Usage for NN search

Strategy 2: backtracking search


dmin := ∞ (min dist. to pts viewed so far) Always succeeds

search (node): (node = root initially) Detail


Querybad
timecase: query
may be point
up to at the center,
linear

if node = leaf : (all cells visited)

dmin :=min{dmin , minp∈node.batch d(q, p)}

else: q′
dmin := min{dmin , d(q, node.point)}

if B(q, dmin ) intersects ”left” side of node.H


backtracking

search (node.lef t)

if B(q, dmin ) intersects ”right” side of node.H


search (node.right) (n0 = 1)

Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node
Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node

Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node
Example

p1 l4 l6 l1

l8 p3
p2 p7 p9
l10
l2 l3
l2 p8
l5
q l3
p4 l4 l5 l6 l7
p10
l9 p6
l8 p3 l9 p6 l10 p9 p10 p11
p5 p11
l7
l1
p1 p2 p4 p5 p7 p8

li : data at internal node


(note: left-right labels are arbitrary)
pi : data at leaf node

Example

best-case input (unif. distrib.):

most cells are fat


q

Thequery
constant is exponential
time c=d O(c d log n) in d
c d ≈ 2d

[Friedman et al.: An algorithm for finding best matches in logarithmic expected time, 1977]
Example

worst-case input (non-unif. distrib.):

generally speaking, the NN problem


q becomes difficult when many
many approximate
skinny cells neighbors


query time = Ω(d n)

Variants: priority search, early backtracking, random cutting hyperplanes, etc.


Outline

• Problem statement

• Naive approach: linear scan

• Challenges and popular approaches

• k-d trees:

- definition

- construction

- usage for NN search: defeatist search vs. backtracking search

- benchmarks and curse of dimensionality

Benchmarks
avg.
lack of query
linearity of time (µs)scan
the linear vs. is#probably
data points:
due to the
(uniform
asymptotic
measure
regime
in unit
starting
square
pretty
in 2d)
late (appa
Benchmarks
avg. stands
query point queryattime (µs) in
the origin vs.this
# experiment.
data points: (uniform
Beware measure
that the onhas
Y-scale unitchanged.
circle in In
2d)fact, the

High dimensions

pre-processing input: P ⊂ Rd

query input: q
NNP (q)
goal: find p ∈NNP (q)
q
Curse of Dimensionality:
d(q, P ) Every data structure for NN-search has
either exponential size or exponential
query time (in d) in the worst case.

→ holds both in theory and in practice [Weber et al. ’98] [Arya et al. ’98]
High dimensions

pre-processing input: P ⊂ Rd

query input: q

q goal: find p ∈NNP (q)

Curse of Dimensionality:
Every data structure for NN-search has
either exponential size or exponential
query time (in d) in the worst case.

→ holds both in theory and in practice [Weber et al. ’98] [Arya et al. ’98]

→ underlying phenomenon: concentration of measure


(distances concentrate around mean) [Demartinez ’94]

Benchmarks
avg.
oint of queryoftime
inversion (µs) and
the linear vs. backtracking
dimension: plots
(10,000
lies around
pts sampled
dim=12. → 2 caveats:
uniformly inside unit
- the
cube)
implementa

→ solve approximate NN problem


What you should know

• Context and definition of NN search problem

• Linear scan

• An idea of existing approaches

• k-d trees: definition, construction, defeatist search, backtracking search

• Curse of dimensionality
Clustering with k-Means

Outline:

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters


Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Clustering (a.k.a. unsupervised classification)


Input: a finite set of observations: - point cloud with coordinates
- distance / (dis-)similarity matrix

ill-posed
a pas de realite terrain fournie, i.e. pas de labels au depart sur les donnees, donc tout reside dans

Task:
partition the data points into homogeneous subsets (clusters)
A wealth of approaches

Variational
Density thresholding
- k-means
- k-medoids - DBSCAN
- OPTICS
- EM

Mode seeking
Spectral
- Mean/Medoid/Quick Shift
- Normalized Cut
- Multiway Cut - graph-based hill climbing

Valley seeking
Hierarchical divisive/agglomerative
- [JBD’79]
- single-linkage
- NDDs [ZZZL’07]
- BIRCH
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

k-Means
Paradigm: cast clustering into an optimization problem

→ minimize total intra-cluster variance

input: P ⊂ Rd finite (#P = n)

hyper-parameter: k: number of clusters

parameters: c1 , · · · , ck ∈ Rd : cluster centers

note: in fact → is{1,a ·labeled


σ : Pthis · · , k}: partition;
partition a different

1 X
min ∥p − cσ(p) ∥22
c1 ,··· ,ck ,σ n p∈P

(visualization tool: http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html)


k-Means
Paradigm: cast clustering into an optimization problem

→ minimize total intra-cluster variance

input: P ⊂ Rd finite (#P = n)


Ci := σ −1 ({i}), ni := #Ci

hyper-parameter: k: number of clusters


1 X
∥p − cσ(p) ∥22
n p∈P parameters: c1 , · · · , ck ∈ Rd : cluster centers
k
1 X X σ : P → {1, · · · , k}: partition
= ∥p − ci ∥22
n i=1 p∈C
1 X
i

k
X ni min ∥p − cσ(p) ∥22
= Var(Ci , ci ) c1 ,··· ,ck ,σ n p∈P
i=1
n

(weighted
ttention: variancesum
avecofun
cluster
choix variances)
particulier de centre ci

{
Var(P, c1 , · · · , ck , σ)
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Characterizing the argmin


Fixed partition: σ : P → {1, · · · , k}

→ optimize each center ci independently, yields center of mass (centroid)

1 X
Prop: argmin Var(Ci , ci ) = p
ci ni p∈C
i
{
{

Fréchet mean arithmetic


mean
(centroid)
Characterizing the argmin
Fixed partition: σ : P → {1, · · · , k}

→ optimize each center ci independently, yields center of mass (centroid)

1 X
Prop: argmin Var(Ci , ci ) = p =: c∗
ci ni p∈C
i

proof:
proof is simpler and more direct than in the polycopié

1 X 1 X
Var(Ci , ci ) = ∥p − ci ∥22 = ∥(p − c∗ ) + (c∗ − ci )∥22
ni p∈C ni p∈C
i i

1 X 2 X 1 X ∗
= ∥p − c∗ ∥22 + (p − c∗ ) · (c∗ − ci ) + ∥c − ci ∥22
ni p∈Ci
ni p∈Ci
ni p∈C
i

= Var(Ci , c∗ ) + ∥c∗ − ci ∥22


{

≥0 □

Characterizing the argmin


Fixed centers: c1 , · · · , ck

→ optimize each assignment σ(p) independently, yields Voronoi partition


Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ

proof:

For each point p ∈ P :


∥p − cσ(p) ∥22 ≥ ∥p − NN{c1 ,··· ,ck } (p)∥22
Hence, by termwise minimization:
X X
∥p − cσ(p) ∥22 ≥ ∥p − NN{c1 ,··· ,ck } (p)∥22
p∈P p∈P


Characterizing the argmin
Fixed centers: c1 , · · · , ck

→ optimize each assignment σ(p) independently, yields Voronoi partition


Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ

V (ci ) := {x ∈ Rd | ∥x − ci ∥2 ≤ ∥x − cj ∥2 ∀j}

Prop: V (ci ) ∩ V (cj ) is affine


proof:

∥x − ci ∥22 = ∥x − cj ∥22
⇐⇒ x2 − 2 x · ci + c2i = x2 − 2 x · cj + c2j
⇐⇒ 2 x · (cj − ci ) + (c2i − c2j ) = 0
(affine equation) □

Characterizing the argmin


Fixed centers: c1 , · · · , ck

→ optimize each assignment σ(p) independently, yields Voronoi partition


Prop: argmin Var(P, c1 , · · · , ck , σ) = p 7→ NN{c1 ,··· ,ck } (p) =: σNN
σ

V (ci ) := {x ∈ Rd | ∥x − ci ∥2 ≤ ∥x − cj ∥2 ∀j}

Prop: V (ci ) ∩ V (cj ) is affine


\ 
∥ ·⇒
−cVr (c
∥2i )designe
= ∈ Rd | ∥x
lax fonction − ci ∥22 ≤a ∥x
”distance cr ”− cj ∥22
j̸=i
{

half-space

⇒ clusters are convex


Characterizing the argmin
This characterization is not unique:

total variance ≈ 4,500 total variance ≈ 8,500

vs.

Both configurations are centroidal Voronoi partitions

Characterizing the argmin


This characterization is not unique:

Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).

proof:

• no pts on boundaries ⇒ Var(P, c∗1 , · · · , c∗k , σ) > Var(P, c∗1 , · · · , c∗k , σNN ) ∀σ ̸= σNN
Characterizing the argmin
This characterization is not unique:

Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).

proof:

• no pts on boundaries ⇒ Var(P, c∗1 , · · · , c∗k , σ) > Var(P, c∗1 , · · · , c∗k , σNN ) ∀σ ̸= σNN

• for σ ̸= σNN fixed, the map (c1 , · · · , ck ) 7→ Var(P, c1 , · · · , ck , σ) is continuous

⇒ for (c1 , · · · , ck ) close enough to (c∗1 , · · · , c∗k ),


Var(P, c1 , · · · , ck , σ) > Var(P, c∗1 , · · · , c∗k , σNN )

• ∃ finitely many partitions ⇒ ∃ common neighborhood for all σ ̸= σNN

• meanwhile, (c∗1 , · · · , c∗k ) are optimal for σNN itself


Characterizing the argmin


This characterization is not unique:

Prop: Every centroidal Voronoi partition (P, c∗1 , · · · , c∗k , σNN ) such that there
are no points of P on the boundaries corresponds to a local minimum
of Var(P, c1 , · · · , ck , σ).

→ degenerate cases (points on boundaries) lead to non-minimal configurations:

the data point on the bisector can be assigned indifferently to the blue cluster or to the red
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Computation: easy special cases


Case k = 1: P = {p1 , · · · , pn } ⊂ Rd

→ put c1 at the center of mass (arithmetic mean) of P


complexity: O(n d)
Computation: easy special cases
Case d = 1: P = {p1 , · · · , pn } ⊂ R

→ solve the problem in polynomial time by dynamic programming:

• sort and relabel the points of p so that p1 ≤ p2 ≤ · · · ≤ pn

now look
Voronoi
at ancells
optimal cluster ⇒
are convex optimal clusters are contiguous: C1 < C2 < · · · < Ck
configuration

Given pj := min Ck , clusters C1 , · · · , Ck−1 are optimal for {p1 , · · · , pj−1 }


n
X
• recurrence: OPT(n, k) := min |pi − cσ(pi ) |2
c1 ,··· ,ck ,σ
i=1

{
OPT(n, k) = min {OPT(j − 1, k − 1) + (n + 1 − j) Var({pj , · · · , pn })}
1≤j≤n

{
sum of squared distances to the mean
OPT(n, k) = 0 if n = 0 and +∞ if n ̸= 0 = k

p1 p2 pj pn

C1 C2 Ck−1 Ck R

Computation: easy special cases


Case d = 1: P = {p1 , · · · , pn } ⊂ R

→ solve the problem in polynomial time by dynamic programming:

• sort and relabel the points of p so that p1 ≤ p2 ≤ · · · ≤ pn

now look
Voronoi
at ancells
optimal cluster ⇒
are convex optimal clusters are contiguous: C1 < C2 < · · · < Ck
configuration

Given pj := min Ck , clusters C1 , · · · , Ck−1 are optimal for {p1 , · · · , pj−1 }


n
X
• recurrence: OPT(n, k) := min |pi − cσ(pi ) |2
c1 ,··· ,ck ,σ
i=1

{
OPT(n, k) = min {OPT(j − 1, k − 1) + (n + 1 − j) Var({pj , · · · , pn })}
1≤j≤n
{
sum of squared distances to the mean
OPT(n, k) = 0 if n = 0 and +∞ if n ̸= 0 = k
P 2
3 2 n
complexity:
quires precomputing k) naive,
O(n the partialorsums with
O(n k)i=j pi linear-time 1, · · · , n, taking
aggregation
for all j = O(n2 ) time
of variances
Computation: hard general case
For arbitrary k > 1 and d > 1:

→ naive: try all kn partitions σ : P → {1, · · · , k}


complexity: O (kn poly(n, k, d))
kd kd

There are →
O(n ) partitions
better: try only but
the the
O nchallenge is topartitions
Voronoi find them, for[Inaba,
σNN which Katoh,
an arrangement
Imai 1994] of quadric

complexity: O nkd+1 poly(n, k, d)

Problem is NP-hard:
- for arbitrary d, even when k = 2
[Aloise
that earlier et of
proofs al. NP-hardness
2009] existed, but that they all turned out to be flawed. Other correct

- for arbitrary k, even when d = 2


[Mahajan et al. 2009]

⇒ no poly-time algorithm in (n, k, d)


(unless P = N P )
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Lloyd’s variational heuristic


ansition: since exact algorithms are so costly, people resort to heuristics in practice → the most widely u

Input: P = {p1 , · · · , pn } ⊂ Rd , k≥1

Initialize cluster centers c1 , · · · , ck ∈ Rd

Repeat:

• compute/update the Voronoi partition


σNN : P → {1, · · · , k} and let σ = σNN

• move each center ci to the centroid


of its cluster Ci = σ −1 ({i})

Until convergence (Var(P, c1 , · · · , ck , σ) stabilizes)


Lloyd’s variational heuristic
ansition: since exact algorithms are so costly, people resort to heuristics in practice → the most widely u

(source: Wikipedia)
Input: P = {p1 , · · · , pn } ⊂ Rd , k≥1

Initialize cluster centers c1 , · · · , ck ∈ Rd

k-means
Repeat:

• compute/update the Voronoi partition


E-step

σNN : P → {1, · · · , k} and let σ = σNN

• move each center ci to the centroid


M-step

of its cluster Ci = σ −1 ({i})


EM
Until convergence (Var(P, c1 , · · · , ck , σ) stabilizes)

→ special
clustering instance
model of the EM by
is parametrized the cluster centers c1 , · · · , ck . The E-step computes the poste
algorithm
EM is more general: non-uniform weights, anisotropic gaussians, soft clustering...

Lloyd’s variational heuristic


Prop: The algorithm converges after finitely many iterations.

proof:

Each E-step decreases the total variance (Voronoi partition)

Each M-step decreases the variance (centroid)

this shows
⇒that
totalthe algorithm
variance converges
decreases strictly at
at the
eachlimit (since decreasing
non-terminal non-negative
iteration (stopping energy),
criterion)

After each iteration, the lowest variance is achieved for the current Voronoi partition σ

strict decrease
⇒ each Voronoi partition is considered during ≤ 1 non-terminal iteration

There are finitely many different Voronoi partitions

⇒ finitely many iterations in total



Lloyd’s variational heuristic
Prop: The algorithm converges after finitely many iterations.

proof:

Each E-step decreases the total variance (Voronoi partition)

Each M-step decreases the variance (centroid)

this shows
⇒that
totalthe algorithm
variance converges
decreases strictly at
at the
eachlimit (since decreasing
non-terminal non-negative
iteration (stopping energy),
criterion)

After each iteration, the lowest variance is achieved for the current Voronoi partition σ

strict decrease
⇒ each Voronoi partition is considered during ≤ 1 non-terminal iteration

There are finitely many different Voronoi partitions at most O nkd iterations

⇒ finitely many iterations in total notes:Complexity: ankd


- this is onlyO(n upper bound,
poly(n, k, d))typically

Lloyd’s variational heuristic


Prop: The algorithm converges after finitely many iterations.

equires toProp:
have no
Thedata point on
algorithm Voronoi generically
converges cell boundaries
to a throughout the iterative
local minimum process
of the total variance.

Shortcomings:

• convergence may require an exponential number of iterations,


even in the plane (d = 2) [Vattani 2011]

• depending on the initialization, the local minimum reached can be arbitrarily bad
[MacQueen 1967]

(lots of room for improvement)

Note: certified (1 + ε)-approx. algos. typically run in time O(2k ε−d n polylog(n, k, d))
[Har-Peled, Mazumdar 2004]
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Initialization
Random centers:

• sampled uniformly in bounding box

• sampled
rgy published uniformly
the same among
method the data
as [Lloyd points
1957] with[Forgy
this particular
1965] initialization

• defined as centroids of a random partition σ : P → {1, · · · , k}


{

(assign each data point randomly


and independently to i ∈ J1, kK)

→speaking,
generally performances
Forgy’shighly
methoddepend on input
is preferred and data
performs best among these approaches, although

→ they degrade quickly as k or d increases

∼ 35% success rate w/ unif. in bound. box on this input


Initialization
k-means++ [Arthur, Vassilvitskii 2007]

→ soft (randomized) variant of furthest-point sampling

draw c1 uniformly at random from P

C := {c1 }

for i = 2 to k do:

draw ci at random from P according to the probability distribution:

d(p, C)2
P(p) := P 2
, where d(x, C) = min ∥x − cj ∥2
q∈P d(q, C)
1≤j<i

C := C ∪ {ci }

done

Initialization
k-means++ [Arthur, Vassilvitskii 2007]

→ soft (randomized) variant of furthest-point sampling

→ easy to implement in arbitrary dimensions

→ theoretical
s already guaranteed after theon
guarantees k-means++ initialization,
the resulting therefore
(initial) total also upon termination
variance

Prop: In expectation, the total variance is within a factor O(log k) of the optimal:

E [Var(P, c1 , · · · , ck , σNN )] ≤ 8 (2 + ln k) min Var(P, c′1 , · · · , c′k , σ))


c′1 ,··· ,c′k ,σ
Outline

• Clustering: problem statement and popular approaches

• k-means: minimizing intra-cluster variance

• Characterizing the argmin

• Computation: easy special cases and hard general case

• Lloyd’s variational heuristic

• Initialization

• Choosing the number of clusters

Choosing the number k of clusters

k = 2 (underfitting) k = 4 (overfitting)
Choosing the number k of clusters

Elbow method (heuristic): [Thorndike 1953]

plot minc1 ,··· ,ck ,σ Var(P, c1 , · · · , ck , σ) as a function of k


minc1 ,··· ,ck ,σ Var(P, c1 , · · · , ck , σ)

∃ elbow in the plot around optimal value of k

elbow

k
2 3 4 5 6 7 8 9 10

arm forearm

Choosing the number k of clusters

Silhouette: [Rousseeuw 1987]


1
P
plot average silhouette n p∈P s(p) as a function of k, where:

b(p) − a(p)
silhouette s(p) := ∈ [−1, 1]
max{a(p), b(p)}

a(p) := average distance of p to points in its cluster C

b(p) := minC ′ ̸=C average distance of p to points in cluster C ′


Choosing the number k of clusters

Silhouette: [Rousseeuw 1987]


1
P
plot average silhouette n p∈P s(p) as a function of k, where:

b(p) − a(p)
silhouette s(p) := ∈ [−1, 1]
max{a(p), b(p)}

1 ∃ peak in the plot around optimal value of k


s(p)
p∈P
P

k
n
1

2 3 4 5 6 7 8 9 10

-1
What you should know

• Context and definition of clustering

• k-means as a minimization problem (intra-cluster variance)

• argmins as centroidal Voronoi partitions

• Efficient computation in cases k = 1 or d = 1

• Lloyd’s algorithm and its properties

• Initialization techniques: random, k-means++

• Number of centers selection: elbow method, silhouette


Hierarchical Clustering

Outline:

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees


Outline

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees

Hierarchical clustering
Input: a finite set of n observations: - point cloud with coordinates
- distance / (dis-)similarity matrix

Task:
partition the data points into k homogeneous subsets (clusters)
Hierarchical clustering
Q: what is the number k of clusters? what if there is more than one solution?

k = 1?

k = 2?

k = 4?

k > 4?

→ solution depends on scale

Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters

scale
Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters

scale

Hierarchical clustering
we multiscale hierarchical
specify hierarchical, clustering:
i.e. clusters only get merged (and not split) as scale increases; dually, clusters

Def: (dendrogram) Given P finite,


a dendrogram on P is a map

θ : R+ → Partitions (P )

such that: dendrogram

▶ θ(0) = {singletons (P )}
▶ ∃t0 : ∀t ≥ t0 , θ(t) = {P }
▶ ∀t ≤ t′ , θ(t) refines θ(t′ ):

∀C ∈ θ(t), ∃C ′ ∈ θ(t′ ) : C ⊆ C ′

scale
Building the hierarchy

agglomerative divisive
(merge (split
clusters) clusters)

Building the hierarchy

Agglomerative hierarchical clustering (AHC):

- build tree from leaves to root

- start with each observation in its own cluster (leaf)

- iteratively merge clusters until only one remains (root)

this stage, this is(cf.


justKruskal’s
an analogy, however
minimum it willtree
spanning bealgortihm)
fleshed out later in the lecture

Divisive hierarchical clustering (DHC):

- build tree from root to leaves

- start with all observations in the same cluster (root)

- recursively split the clusters until they become singletons (leaves)

Same here (cf. kd-tree construction)


Building the hierarchy

Combinatorial aspects:

• n/−merge
division operation
1 steps for eachadds / subtracts
approach one cluster
(with two-fold merges or splits)

• at step k:
n−k

AHC: n − k clusters and 2 choices for merge
the most commonly
frame means that in the course we only look at AHC, for the reasons just invoked.
used approach
⇒ average size of search space: Θ(n2 )

P 
Choice of splitDHC:
: a. choose cluster
k clusters n1 , ·⇒
to split
of sizes · · sum and ki=1
, nk , choices ni −1
over 2all − 1 choices
clusters; b. for each cluster
for split

⇒ average size of search space: Θ(2n )

the historical approach


Outline

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees

Agglomerative hierarchical clustering

Meta-algorithm: input: (P, d) where P = {p1 , · · · , pn }

clusters are indexed


start by thedata
with each setspoint pi ∈ Pthey
of indices own cluster {pi } at height 0
contain
as its

for k = 1 to n − 1 do: // invariant: ∃ n − k clusters at end of iteration

choose the next two clusters C, C ′ to be merged (C, C ′ ⊂ {1, · · · , n})

merge C and C ′ by replacing them with a single cluster C ∪ C ′

record the new height and assign it to the merge of C, C ′

done

→aspect,
this defer the students
use a union-find to their
data structure algorithmics
(e.g. courses
disjoint-set (e.g.
forest), as inINF421)
Kruskal’s algorithm
Agglomerative hierarchical clustering

Meta-algorithm: input: (P, d) where P = {p1 , · · · , pn }

clusters are indexed


start by thedata
with each setspoint pi ∈ Pthey
of indices own cluster {pi } at height 0
contain
as its

for k = 1 to n − 1 do: // invariant: ∃ n − k clusters at end of iteration

choose the next two clusters C, C ′ to be merged (C, C ′ ⊂ {1, · · · , n})

merge C and C ′ by replacing them with a single cluster C ∪ C ′

record the new height and assign it to the merge of C, C ′

done

→ define a distance δ between clusters and: - merge the closest clusters


- their distance is the new height

Agglomerative hierarchical clustering

Distances δ based on ground metric d:

• minimum distance (single-linkage):

δSL (C, C ′ ) := min



d(p, p′ )
p∈C,p ∈C ′

• maximum distance (complete-linkage):

δCL (C, C ′ ) := max d(p, p′ )


p∈C,p ∈C ′

• mean distance (average-linkage):

′ 1 X X
δAL (C, C ) := ′
d(p, p′ )
|C| |C | p∈C ′ ′
p ∈C
Impact of choice of distance δ

Real dataset (Fisher’s Iris data):

▶ n = 150 observations

▶ d = 4 variables:
sepal length/width,
petal length/width

▶ k = 3 species:
virginica, versicolor, setosa

http://archive.ics.uci.edu/ml/datasets/Iris

Impact of choice of distance δ

▶ single-linkage:

+ efficient implementation (Kruskal’s algo.)


+ stability guaranties against (small) noise
- sensitive to outliers
- unbalanced trees / chaining effect
Impact of choice of distance δ

▶ complete-linkage:

+ balanced tree
- sensitive to outliers
- no theoretical guarantees

Impact of choice of distance δ

▶ average-linkage:

+ tradeoff balancedness/stability
- no theoretical guarantees
Distances based on statistical quantities
involves Distance
a statistical
δ based
quantity
on (weighted)
that connects
intra-cluster
AHC to k-means — Ward’s
variance(same criterion:
objective function: sum of

δ(C, C ′ ) := |C ∪ C ′ |Var(C ∪ C ′ ) − (|C|Var(C) + |C ′ |Var(C ′ ))


|C| |C ′ |
= |C|+|C ′ |
∥EC − EC ′ ∥22 in Euclidean space

▶ next
weighted merge is variance
intra-cluster the one isthat leastbut
nothing increases
the sum the (weighted)
of squared intra-cluster
distances variance
to the mean, as in k-means

▶ δ(C, isC ′easily


Non-negativity ) is non-negative monotonously
andabove,
from the formula increasing
involving the distance between cluster centroids. It implies

▶ greedy approach ⇒ intra-cluster variance is not minimal for a fixed #clusters



output clusters can be optimized using k-means (same objective function)

Distances based on statistical quantities

▶ Ward’s criterion on iris data:


Distances based on statistical quantities
involves Distance
a statistical
δ based
quantity
on (weighted)
that connects
intra-cluster
AHC to k-means — Ward’s
variance(same criterion:
objective function: sum of

δ(C,
gros on divise C ′ ) := |Cla∪distance
globalement C ′ |Var(C ∪C
entre ′
) −clusters
deux (|C|Var(C)
par la + |C ′total
taille du ′ cluster
|Var(C ∪ C ′(i.e.
)) / |C joint | par

|C| |C | ′ 2
= (|C|+|C ′ |)2∥EC − EC ∥2 in Euclidean space

▶ using unweighted variance may lead to reversals in the output dendrogram

p3 δ({p1 }, {p2 }) = 1/4

δ({p1 , p2 }, {p3 }) = 1/6


1 (reversal)

δ({pi }, {pi }) = 0
p1 p2 p1 p2 p3

Distances based on statistical quantities

▶ unweighted intra-cluster variance criterion on iris data:


Outline

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees

Single-linkage clustering

δ(C, C ′ ) := min

d(p, p′ )
p∈C,p ∈C ′

(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously

Def: given s ≥ 0 and pi ∈ P , call Cs (pi ) the cluster


containing pi when δ = s

(note: s ≤ t ⇒ Cs (pi ) ⊆ Ct (pi ) — merges only)


Single-linkage clustering

δ(C, C ′ ) := min

d(p, p′ )
p∈C,p ∈C ′

(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously

Prop:the(invariant)
words, clusters are
Forgiven i ∈
any pby the
P CCs
and of s≥
anythe neighborhood
0, graph. This invariant shows

Cs (pi ) ≡ connected component (CC) of s-neighborhood graph

(P, {(pk , pl ) | d(pk , pl ) ≤ s}) that contains pi


pj
pi
proof (by induction on s):

• true for s = 0 (δ = 0 ⇒ singleton clusters and CCs) Cs (pi ) Cs (pj )


s
tem implies that
• if true for the
δ = potential breaks
s, then still of δthe
true for = tinvariant
for any can only occur at edge lengths
t > s such that interval (s, t] contains no pairwise distance d(pk , pl )

• true for s = d(pk , pl ):


the definitions:
edge (pk ,-plthe edge connects
) changes two
the CCs of theCCs in iff
graph thed(p
neighborhood graph iff it is the
k , pl ) is cluster-connecting □ smallest

Single-linkage clustering

δ(C, C ′ ) := min

d(p, p′ )
p∈C,p ∈C ′

(note:
Thus, from nowδ increases fromthe
on we abuse 0 to ∞ during
notation and agglomerative
see δ also as aprocess)
real quantity that increases continuously

Algo.: (Kruskal — minimum spanning tree ≡ set of connecting edges)

1. sort edges of complete graph on P by increasing length

2. iterate over the edges in this order:


▶ if the next edge connects 2 points from different clusters C, C ′ ,
then it triggers the next merge C, C ′ 7→ C ∪ C ′
▶ else the edge can be ignored

Details / complexity:
2
re-sorting
- sort
is edges
what allows
in O(nthelogalgorithm
n) time using
to be merge
fast, compared
sort to complete linkage or average
2
Ackermann function
- iterate is defined
over edges recursively
in O(n by a using
α(n)) time doubledisjoint-set
induction forest
as follows:
 inverse Ackermann function
 n + 1 if m = 0
Example

when δ = s:

s-neighborhood graph

intersection graph of balls of radius s/2 δ
0 2 4 6 8 10 12 14 16

Example

when δ = s:

s-neighborhood graph

intersection graph of balls of radius s/2 δ
0 2 4 6 8 10 12 14 16
Example

pi

pj

equality
heightholds by common
of least definition of the (LCA)
ancestor dendrogram. By our previous invariant, it is equivalent to
=
smallest s for which Cs (pi ) = Cs (pj ) δ
0 2 4 6 8 10 12 14 16
Outline

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees

Connection to ultrametrics and stability


connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P


true for any dendrogram,
note: d : P × P → R ultrametric if:
not only that of single-linkage
• non-negativity: d ≥ 0

• symmetry: d(pi , pj ) = d(pj , pi )

• identity: d(pi , pj ) = 0 =⇒ p i = pj

• ultrametric inequality: d(pi , pk ) ≤ max{d(pi , pj ), d(pj , pk )}


Connection to ultrametrics and stability
connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P

proof:

• non-negativity: height of LCA is ≥ 0

• symmetry: LCA(pi , pj ) = LCA(pj , pi ) =⇒ dθLCA (pi , pj ) = dθLCA (pj , pi )

• identity: dθLCA (pi , pj ) = 0 =⇒ pi = LCA(pi , pj ) = pj

• ultrametric inequality: dθLCA (pi , pk ) ≤ max{dθLCA (pi , pj ), dθLCA (pj , pk )}

height of lowest path pi → pk ≤ height of path pi → pj → pk

pk pi pj

Connection to ultrametrics and stability


connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P

Thm: The map Ψ : θ 7→ dθLCA is a bijection between the dendrograms on P


and the ultrametrics on P . Its inverse Ψ−1 sends any ultrametric d
d
to the dendrogram θSL given by single-linkage clustering on (P, d).

proof: see INF631 (lecture 3)

https://www.enseignement.polytechnique.fr/informatique/INF631/
Connection to ultrametrics and stability
connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P

Thm: The map Ψ : θ 7→ dθLCA is a bijection between the dendrograms on P


and the ultrametrics on P . Its inverse Ψ−1 sends any ultrametric d
d
to the dendrogram θSL given by single-linkage clustering on (P, d).

resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in

For any metrics d, d′ : P × P → R (possibly not ultrametrics),



θd θd
maxP ×P |dLCA
SL
− dLCA
SL
| ≤ maxP ×P |d − d′ |

(note: by previous theorem, inequality is an equality when d, d′ are ultrametrics )

proof: see INF631 (lecture 3)

https://www.enseignement.polytechnique.fr/informatique/INF631/

Connection to ultrametrics and stability


connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P

Thm: The map Ψ : θ 7→ dθLCA is a bijection between the dendrograms on P


and the ultrametrics on P . Its inverse Ψ−1 sends any ultrametric d
d
to the dendrogram θSL given by single-linkage clustering on (P, d).

resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in

For any metrics d, d′ : P × P → R (possibly not ultrametrics),



θd θd
maxP ×P |dLCA
SL
− dLCA
SL
| ≤ maxP ×P |d − d′ |
pi pj R

→ height of LCA is stable... but not LCA itself


Connection to ultrametrics and stability
connection makes it possible to prove formal properties on single-linkage clustering.

Def: Given θ a dendrogram, dθLCA (pi , pj ) := height of LCA in θ

Prop: dθLCA is an ultrametric on P

Thm: The map Ψ : θ 7→ dθLCA is a bijection between the dendrograms on P


and the ultrametrics on P . Its inverse Ψ−1 sends any ultrametric d
d
to the dendrogram θSL given by single-linkage clustering on (P, d).

resultThm:
shows[Carlsson,
that dLCAMémoli
is stable
2010]
under small perturbations of the ground metric, i.e. of the data in

For any metrics d, d′ : P × P → R (possibly not ultrametrics),



θd θd
maxP ×P |dLCA
SL
− dLCA
SL
| ≤ maxP ×P |d − d′ |
pi pj R

→ height of LCA is stable... but not LCA itself

Connection to ultrametrics and stability


connection makes it possible to prove formal properties on single-linkage clustering.

Recall Chaining effect:


the max in a few outliers
the inequality of thedistort the metricSingle-linkage
last theorem... in worst caseis stable under small perturba
=⇒ clusters get merged far earlier than expected

s
Outline

• Principles of hierarchical clustering

• Agglomerative hierarchical clustering

• Single-linkage clustering

• Connection to ultrametrics and stability

• Hierarchical clustering and phylogenetic trees

Hierarchical clustering and phylogenetic trees


The challenge forphylogenetic
A (rooted) researchers istree
theshows
following one: given arelationships
the evolutionary (dis-)similarity
amongmeasure
various(or
biological species, based upon some measure of (dis-)similarity between species.

example of phylogenetic tree, for a subset of the plantigrade species.


▶ Goal: given a collection of species together with a distance/dissimilarity
measure d, build a tree whose LCA distance is as close to d as possible.
Hierarchical clustering and phylogenetic trees

Common approach: build a dendrogram θ from d


▶ approximate d by an ultrametric dθ
▶ all species lie at the same level / evolve at the same rate (molecular clock)

Hierarchical clustering and phylogenetic trees


UPGMAUPGMA
stands for algorithm:
”unweighted pair group method with arithmetic mean”

Input: set P of species, distance d : P × P → R given as an n × n matrix M

Process:
• run average-linkage clustering on M ⇝ dendrogram θAL
d

d
• scale θAL by a factor of 1/2 ⇝ dendrogram θAL
′d
:

▶ each leaf is assigned height 0

▶ each internal node (corresponding to a merge C ∪ C ′ )


1
is assigned height 2
δAL (C, C ′ )
′d
Output: θAL
d d
roofThm:
of equality
If d isisanbyultrametric
induction (left
then as
θAL . to the students): • true at t = 0: both
an=exercise
θSL
θd θd θ ′d
”therefore” Therefore,
comes from thedLCA
d= aforementioned
SL AL
= dLCA = 2theorem
AL
dLCA . on canonical bijection between dendrogram
What you should know

• Concepts: dendrograms, divisive vs. agglomerative

• Meta-algorithm for agglomerative hierarchical clustering

• Geometric and statistical distances, and their impact

• Single-linkage clustering: algorithm, invariant, complexity

• Bijection dendrograms ←→ ultrametrics

• Stability result

• Optional: UPGMA algorithm to build phylogenetic trees


Density Estimation

Outline:

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators
Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

Mathematical formulation

Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees

Goal: estimate f from Pn , i.e. build an estimator fˆn : Rd → R

Input: Pn ⊂ R2 Prior: ∃ν w/ density f Output: fˆn


Mathematical formulation

Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees

Goal: estimate f from Pn , i.e. build an estimator fˆn : Rd → R

Sample applications:

noise filtering

Mathematical formulation

Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees

Goal: estimate f from Pn , i.e. build an estimator fˆn : Rd → R

Sample applications:

outlier detection/removal
Mathematical formulation

Input: Pn = {p1 , · · · , pn } ⊂ Rd
iid
Prior:
omesse pi ∼ νd’une
: existence for some unknown
mesure probability
de proba. measure
avec densite ν withaux
sous-jacente density f : Rd → R
donnees

Goal: estimate f from Pn , i.e. build an estimator fˆn : Rd → R

Sample applications:
density estimator

Rd

clustering (DBSCAN, mean-shift, etc.)


Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

f
E 
P
n fˆn 
r
+
density

Va
 r
graphique represente la vraie densite (en bleu), la moyenne
P
n fˆ  de l’estimateur (en rouge plein),
n

 

EP n fˆn
EP  r
n fˆn −
Var  
Pn fˆn

Rd
Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased

→ fˆn (x) is biased if BiasPn (fˆn (x)) ̸= 0

→ fˆn (x) is unbiased if BiasPn (fˆn (x)) = 0

→ fˆn (x) is asymptotically unbiased if lim BiasPn (fˆn (x)) = 0


n→∞

Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

→ bias-variance decomposition (allows to measure the bias-variance tradeoff):

Thm: MSEPn (fˆn (x)) = BiasPn (fˆn (x))2 + VarPn (fˆn (x))

Proof sketch: decompose fˆn (x) − f (x) = fˆn (x) − EPn fˆn (x) + EPn fˆn (x) − f (x)
then follow your intuition... □
Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

▶ Proof of bias-variance decomposition:


 2  2
EPn fˆn (x) − f (x) = EPn fˆn (x) − EPn fˆn (x) + EPn fˆn (x) − f (x)
 2  2
= EPn fˆn (x) − EPn fˆn (x) + EPn EPn fˆn (x) − f (x)
  cst.
+ 2 EPn fˆn (x) − EPn fˆn (x) EPn fˆn (x) − f (x)
 
= VarPn fˆn (x) + BiasPn (fˆn (x))2 + cst · EPn fˆn (x) − EPn fˆn (x)

= VarPn fˆn (x) + BiasPn (fˆn (x))2 . □

Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞

convergence in probability: ∀ε > 0 lim PPn (|fˆn (x) − f (x)| > ε) = 0


n→∞
Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞

n→∞
Prop: MSEPn (fˆn (x)) −→ 0 =⇒ fˆn (x) is consistent

e.g. LemmaProof
2.2.2 sketch:
of Durrett’s ˆ
let XProbability
n = fn (x)-−Theory
f (x) . and Examples,
Using 4th edition.
Chebyshev’s inequality,
n→∞
convergence in L2 (i.e. E Xn2 −→ 0) implies convergence in probability. □

Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞

▶ Proof of consistency: let ε > 0 and define Xn = fˆn (x) − f (x) .

n→∞
EXn2 −→ 0 =⇒ VarXn + (EXn )2 −→ 0 =⇒ VarXn −→ 0 and |EXn | −→ 0.
∃N ∈ N s.t. ∀n > N , |EXn | ≤ ε/2
4
=⇒ ∀n > N , P(|Xn | ≥ ε) ≤ P(|Xn − EXn | ≥ ε/2) ≤ ε2
VarXn −→ 0.
(Chebyshev)
Quality of a density estimator

Note: For a fixed x ∈ Rd , the estimate fˆn (x) is itself a random variable in R
   
erance→etmean/expectation
variance par rapport ˆ
EPnau ftirage
n (x) des
, points
variance
p1 ,Var ˆ
· · · P, npn f→ l’estimateur oscille autour
n (x)

 
Bias:
ending onBias ˆ
the Pchoice n, E
of P:=
n (fn (x)) the ˆ
fn (x) −
Pn estimator fˆnf(x)
(x) oscillates around its mean. It is unbiased
 2 
Erreur Mean Squared
quadratique Error (MSE):
moyenne, L2 (sur
ou erreurMSE ˆn (x)) := EPdes
Pn (fl’ensemble fˆn (x)de
n tirages − points
f (x) possibles). Autrem

Convergence
convergence (consistency):
in probability. Note that plim fˆn (x)
asking that=lim
f (x) ˆ
n→∞ PPn (fn (x) = f (x)) = 1 would be
n→∞

Robustness: (sloppy def.)

expectation, variance, bias, MSE of estimator are not perturbed too much
by adding outliers (e.g. consistency and asymptotic unbiasedness preserved)
Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

Parametric vs. non-parametric estimation

Parametric estimator: assumes that ν belongs to a known parametrized family

→ approach: compute estimate(s) of parameter(s) then take corresponding fˆn

→ examples
• Normal/Gaussian, Poisson, exponential families, etc.
• mixture models

Nonparametric estimator: no underlying parametrized family assumed

→ approach: estimate f pointwise directly

→ examples
• histograms
• kernel density estimators
• k-NN estimator
Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.

iid 
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2

→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
 2

1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2
consistent
Estimators: (continuous
The continuous mapping
mapping theorem)
theorem states that, when
n
1X
• empirical mean: µ̂n := pi → unbiased, consistent (law of large numbers)
n i=1

n
1X
Indeed,•aempirical
calculation
variance: σ̂n2
shows that (pi − µ̂n )2 →empirical
:= the mean/expected negativelyvariance is always smaller
biased, consistent
n i=1
This theorem states that theorem)
(Cochran’s the corrected empiri
n
2 1 X
• corrected empirical variance: σ̂n := (pi − µ̂n )2 → unbiased, consistent
n − 1 i=1
parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.

iid 
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2

→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
 2

1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2

Rates of convergence:
√ 
• empirical mean: |µ̂n − µ| = OPn 1/ n (Berry-Esseen theorem)

Note: ∆ is stochastic boundedness:
a stochastic ∀ε depend
bound, it may > 0 ∃∆,onNε>and
0 s.t. PPn (|µ̂n −large,
be arbitrarily µ| n possibly
> ∆) < ε, ∀n >to
diverging N infinit
√ 
• empirical variance (corrected or not): |σ̂n2 − σ 2 | = OPn 1/ n

√ 
ollows •from theestimator:
density |fˆn (x)via
above bounds − fsome calculation.
(x)| = OPn 1/ nThe obtained bound
(follows from implies that
a calculation)

parametric
Gaussian model
e.g. Yen-Chi Chen’s course notes for STAT 425, Lecture 6.

iid 
Univariate case (d = 1): p1 , · · · , pn ∼ ν = N µ, σ 2

→ approach: compute estimates µ̂n and σ̂n2 of mean and variance, then define:
 2

1 (x − µ̂ n )
fˆn (x) := √ exp −
2πσ̂n2 2σ̂n2


Non-Gaussian case: (ν has mean µ and variance σ 2 but ν ̸= N µ, σ 2 )
 2

1 (x − µ) √
• fˆn (x) the
is because still empirical to f¯(x)and
converges mean √
:= varianceexp
still −
converge2 to at rate
the true1/mean
n and va
2πσ 2 2σ

bias• is|f¯(x)
constant (i.e.
− f (x)| ̸= 0independent of the sampling), therefore it cannot be reduced
(constant bias)
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)

→ hypothesis: covariance matrix Σ is non-singular


 
1 1 T −1
→ density function: f (x) = p exp − (x − µ) Σ (x − µ)
(2π)d det Σ 2

parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)

The message is that the


→ approach: approach
compute is theµ̂exact
estimates n and counterpart
Σ̂n of µ and to
Σ, the
then1-d case
define:
 
1 1
fˆn (x) := q exp − (x − µ̂n )T Σ̂−1
n (x − µ̂n )
2
(2π)d det Σ̂n

Estimators:
n
1X
• empirical
is because themean: µ̂n :=
d-dimensional pi → unbiased,
(empirical) consistent
mean is the vector of coordinate-wise 1-dimen
n i=1

n
1X
• empirical covariance: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → biased, consistent
n i=1

n
1 X
• corr. emp. cov.: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → unbiased, consistent
n − 1 i=1
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)

The message is that the


→ approach: approach
compute is theµ̂exact
estimates n and counterpart
Σ̂n of µ and to
Σ, the
then1-d case
define:
 
1 1
fˆn (x) := q exp − (x − µ̂n )T Σ̂−1
n (x − µ̂n )
2
(2π)d det Σ̂n
consistent
Estimators: (continuous mapping theorem)
n
1X
• empirical
is because themean: µ̂n :=
d-dimensional pi → unbiased,
(empirical) consistent
mean is the vector of coordinate-wise 1-dimen
n i=1

n
1X
• empirical covariance: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → biased, consistent
n i=1

n
1 X
• corr. emp. cov.: Σ̂n := (pi − µ̂n )(pi − µ̂n )T → unbiased, consistent
n − 1 i=1

parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)

The message is that the


→ approach: approach
compute is theµ̂exact
estimates n and counterpart
Σ̂n of µ and to
Σ, the
then1-d case
define:
 
1 1
fˆn (x) := q exp − (x − µ̂n )T Σ̂−1
n (x − µ̂n )
2
(2π)d det Σ̂n

Rates of convergence:
p 
• empirical mean: ∥µ̂n − µ∥2 = OPn a different
d/n (mean is defined argument applies
coordinate-wise)

p 
• empirical covariance (corrected or not): ∥Σ̂n − Σ∥ = OPn d/n

operator
i.e. ∥M ∥ = supnorm ∥M.v∥
v̸=0 ∥v∥
p 
Basically •the exactestimator:
density ˆ
|fn (x)
same analysis − fthe
as in (x)|1-d
= case
OPn holds,
d/nexcept thecalculation
(same influence ofasthe dimension
in 1-d)
parametric
Gaussian model
iid
d × d covariance
is a Multivariate matrix
case (d > 1): p1 , · · · , pn ∼ ν = Nd (µ, Σ)

The message is that the


→ approach: approach
compute is theµ̂exact
estimates n and counterpart
Σ̂n of µ and to
Σ, the
then1-d case
define:
 
1 1
fˆn (x) := q exp − (x − µ̂n )T Σ̂−1
n (x − µ̂n )
2
(2π)d det Σ̂n

Non-Gaussian case: (ν has mean µ and covariance matrix Σ but ν ̸= N (µ, Σ))
 
1 1
• fˆn (x) still converges to f¯(x) := p exp − (x − µ)T Σ−1 (x − µ)
d
(2π) det Σ 2
p
at rate d/n

• |f¯(x) − f (x)| ̸= 0 (constant bias)


Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

parametric
Gaussian mixture models (GMMs)

Aim: remove bias |f¯(x) − f (x)| when f is not Gaussian

X r
cetteTarget
densitedensity:
dite ”demixture
melange” f¯(x) apres
(on verra
density = pourquoi) est par definition une combinaison
ξl ϕl (x), where:
l=1

▶ each ϕl is a pobability density function (pdf)


▶ each weight ξl is ≥ 0
r
X
▶ the weights ξl sum up to 1: ξl = 1
l=1

Our setup: Gaussian mixture: each ϕl is the pdf of some Nd (µl , Σl ):


 
1 1 T −1
ϕl (x) = p exp − (x − µl ) Σl (x − µl ) =: Φµl ,Σl (x)
(2π)d det Σl 2

iid Pr
Underlying generative model: p1 , · · · , pn ∼ l=1 ξl Nd (µl , Σl ) (GMM)
parametric
Gaussian mixture models (GMMs)

Aim: remove bias |f¯(x) − f (x)| when f is not Gaussian

X r
cetteTarget
densitedensity:
dite ”demixture
melange” f¯(x) apres
(on verra
density = pourquoi) est par definition une combinaison
ξl ϕl (x), where:
l=1

▶ each ϕl is a pobability density function (pdf)


▶ each weight ξl is ≥ 0
r
X
▶ the weights ξl sum up to 1: ξl = 1
l=1

Our setup: Gaussian mixture: each ϕl is the pdf of some Nd (µl , Σl ):

Thm: [Bacharoglou 2010]


implies pointwise convergence of the bias |f¯n (x) − f (x)| to zero as n goes to infinity. Note
For every continuous, compactly supported, pdf f on R, there is a sequence
(f¯n )n≥1 of Gaussian mixtures (of finite length rn → ∞) s.t. ∥f¯n − f ∥∞ → 0.

parametric
Gaussian mixture models (GMMs)

Approach:
estimateur visefor a fixed r, estimate
a approcher f¯. the weights ξl and parameters µl , Σl , then define:
X r
fˆn (x) := ξˆl Φµ̂l ,Σ̂l (x)
l=1

Maximum
on travaille souslikelihood
l’hypothese estimation (MLE):proviennent effectivement d’un modele de melange
que les donnees
n r
!
X X
(ξˆl , µ̂l , Σ̂l )rl=1 := argmax log ξl Φµ ,Σ (pi ) l l
(ξl ,µl ,Σl )r
l=1 i=1 l=1

log-likelihood of (pi )n r
i=1 given (ξl , µl , Σl )l=1

iid Pr
·· ▶ ∼ l=1 of
, pnlikelihood ξl N
parameters (ξl , µl , Σl )rl=1
d (µl , Σl ) Probability := probability
of observing of observing
Pn given theofsample
the choice parameters
Pn
r
X
L(pi ; (ξl , µl , Σl )rl=1 ) = ξl Φµl ,Σl (pi ) (mixture density)
l=1
n
Y
L((pi )n
i=1 ; (ξl , µl , Σl )rl=1 ) = L(pi ; (ξl , µl , Σl )rl=1 ) (independence)
i=1
parametric
Gaussian mixture models (GMMs)

Approach:
estimateur visefor a fixed r, estimate
a approcher f¯. the weights ξl and parameters µl , Σl , then define:
X r
fˆn (x) := ξˆl Φµ̂l ,Σ̂l (x)
l=1

Maximum
on travaille souslikelihood
l’hypothese estimation (MLE):proviennent effectivement d’un modele de melange
que les donnees
n r
!
X X
(ξˆl , µ̂l , Σ̂l )rl=1 := argmax log ξl Φµ ,Σ (pi )
l l
(ξl ,µl ,Σl )r
l=1 i=1 l=1

log-likelihood of (pi )n r
i=1 given (ξl , µl , Σl )l=1

▶ good implies
Consistency asympt.asymptotic
behavior w.r.t. f¯: consistency,
unbiasedness achieves Cramér–Rao
- The Cramér-Rao lower
lower bound is a bound
lower bound
▶ no closed-form solution
▶ variational
these solvers are quickly
solvers
costly
(gradient
(e.g. EM with Expectation-Maximization)
ascent, 50 Gaussians is the Holy Grail) and/or get stuck
▶ non-concave functional ⇒ local maxima, non-unique global maximum
Indeed,▶the largerofr,mixture
choice the larger
sizethe number
r: large biasof(small
parameters
r) vs. in the variance
large model and estimator,
(large r) hence
Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

nonparametric
Histograms

Principle: Build a tessellation of Rd (grid, Voronoi diagram, etc.),


then record the number of observations in each cell

(image source: http://www.wikiwand.com/en/Multivariate_kernel_density_estimation)


nonparametric
Histograms

Principle: Build a tessellation of Rd (grid, Voronoi diagram, etc.),


then record the number of observations in each cell

Uniform grid:

• assume wlog that Pn ⊂ [0, 1)d

• tessellate [0, 1)d with uniform grid of size N d

d’autres termes, #observations


totale de lainmesure
Cell(x)
• For any xc’est
∈ [0,la1)fraction
d
, let fˆnde
(x)la:=
masse empirique qui se situe dans
n Vol(Cell(x))
n
!
Nd X
= 1pi ∈Cell(x)
n
N x i=1

Cell(x)

N
nonparametric
Histograms
(hypothesis: f Lipschitz-continuous)
Bias:
  N
n
d X
EPn la
on applique fˆnlinearite P (pi ∈ Cell(x)) = N d P (p1 ∈ Cell(x))
(x) = de l’esperance
n i=1
Z
d
fonction (continue) atteint
= N son minimum f (y)∈et[fson
f (u) du (y),maximum (z) sur
f (z)] for fsome ∈ Cell(x)
y, zCell(x) (qui est compa
Cell(x)
this (f
implies that f Cell(x)
continuous, reachescompact)
its minimum

= f (x ) for some x∗ ∈ Cell(x)
take(intermediate
any path (invalue
facttheorem,
the straight-line segment) between y and z in the
Cell(x) path-connected)

    √
d
Bias fˆn (x) = EPn fˆn (x) − f (x) = |f (x∗ ) − f (x)| ≤ Lipf
N

Note: bias decreases with N x and x∗ belong


(diameter to the same
of Cell(x))
nonparametric
Histograms
(hypothesis: f Lipschitz-continuous)
Variance:
  N 2d X n
 N 2d 
varianceVar ˆ
enPngeneral
fn (x)n’est
= pas 2la somme 1pivariances
Var des ∈Cell(x) (des
= termes Varde1covariance
p1 ∈Cell(x) apparaissent),
n i=1 n

N 2d 
comes from the definition P (p1for
= of variance ∈ Cell(x))
a discrete−univariate random2 variable v: Var(v) :=
P (p1 ∈ Cell(x))
n
d
 
N ∗ 1 ∗ 2
that f (x∗ ) depends only=on the true ) − fd ,fand
f (xdensity (x )not on the grid
Note: step increases
variance N (recall with
its definition
N
n N

    1
d N d f (x∗ ) 2n Lip2 d+2
MSEPn fˆn (x) ≤ Lip2f + The optimal f
NOPT =N annihilates
f (x∗ )
the derivative
N2 n

   2

MSEOPT fˆn (x) = O n − d+2

nonparametric
Histograms

▶ NOPT is unknown in practice (must be inferred)

The▶ tessellation
the grid does not adapt
is chosen to thew/o
a priori, shape of the support
knowledge of ν of the support of the densit
of the shape

▶ the tessellation
It actually depends oncan become
what costly
you do withtothe
maintain
density:as- For
d increases
mere pointwise evaluation,

    1
d N d f (x∗ ) 2n Lip2 d+2
MSEPn fˆn (x) ≤ Lip2f 2 + The optimal f
NOPT =N annihilates
f (x∗ )
the derivative
N n

   2

MSEOPT fˆn (x) = O n − d+2
Outline

• Mathematical formulation

• Quality criteria for density estimators

• Parametric vs. non-parametric estimators

• Parametric estimators:
- Gaussian model
- Gaussian mixture models (GMMs)

• Non-parametric estimators:
- histograms
- kernel density estimators

nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of

Principle: make fˆn a mixture of copies of an ‘elementary’ density (kernel),


anchored at each observation
(density)
nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of

Principle: make fˆn a mixture of copies of an ‘elementary’ density (kernel),


anchored at each observation

(image source: http://www.wikiwand.com/en/Multivariate_kernel_density_estimation)

nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of

Principle: make fˆn a mixture of copies of an ‘elementary’ density (kernel),


anchored at each observation

is aGeneral formula:
convolution of the(convolution)
empirical measure by the density KH
n
ˆ 1 X
fn (x) := KH (x − pi ), where KH (u) := (det H)−1/2 K(H −1/2 u)
n i=1

• H: inner-product (positive-definite) d × d matrix (adds scaling / anisotropy)


instance, in the previous illustration, K is am isotropic d-variate Gaussian function, and H
• K : Rd → R+ : d-variate kernel:
Z Z
K(u) du = 1 (normalized) u K(u) du = 0 (centered at origin)
Rd Rd
Z
lim K(u) = 0 (vanishes at infinity) uuT K(u)matrix
The covariance du = associated with K is
cK Id (isotropic)
∥u∥→∞ Rd
nonparametric
Kernel density estimators
Kernel-based estimators have been designed to adapt naturally to the shape of the support of

Specialization 1: take H = σ 2 Id (isotropic kernel)

bandwidth / window

Specialization 2: take K(u) ∝ k(∥u∥22 ) for some k : R+ → R+


(radially-symmetric kernel)
kernel profile

Z R −1 R
for thenormalizing
normalization of cthe
factor: kernel: k(∥u∥
k,d := Rd
2
K(u) du = Rd ck,d k(∥u∥2 ) du = 1
2 ) du
Rd

Xn  2

c k,d ∥x − p i ∥
⇝ fˆn (x) := k 2
n σ d i=1 σ2

nonparametric
Common kernels
(
1 if t ≤ 1
Flat / Uniform: kU (t) := ⇝ ck,d = 1/Vol Bd (0, 1)
0 if t > 1
Γ(d/2 + 1)
=
π d/2
1

that the kvolume of the unit ball in 1-d


KUis 2
U

0 1
(
1 − t if t ≤ 1 d+2
Epanechnikov: kE (t) := ⇝ ck,d =
0 if t > 1 2 Vol Bd (0, 1)

kE KE
0 1
nonparametric
Common kernels

Gaussian: kN (t) := exp (−t/2) ⇝ ck,d = (2π)−d/2

kN KN
0 1
(
1 − t if t ≤ 1 d+2
Epanechnikov: kE (t) := ⇝ ck,d =
0 if t > 1 2 Vol Bd (0, 1)

kE KE
0 1
nonparametric
Common kernels

Old faithful geyser dataset (available in R):


(number of occurrences)

- 1st coordinate: waiting time (sec.) between eruptions

- 2nd coordinate (unused): eruptions duration (sec.)

40 50 60 70 80 90 100

Gaussian kernel Uniform kernel Epanechnikov


kernel
nonparametric
Influence of the bandwidth

• small σ (undersmoothing): small bias (sensitivity), large variance (instability)

• large σ (oversmoothing): large bias (insensitivity), small variance (stability)

σ=1
σ=3
σ = 10

Old geyser dataset


40 50 60 70 80 90 100

nonparametric
Convergence rates

showRadially-symmetric Gaussian
the case of the Gaussian kernel
kernel for in Rd :- it is a typical behavior for a KDE - it is the
2 reasons:
n
X  
1 ∥x − pi ∥22
fˆn (x) := exp −
(2π)d/2 n σ d i=1
2σ 2

  
Bias: EPn ˆ
fn (x) − f (x) = O σ 2 (decreases as σ → 0)

   
1
Variance: VarPn fˆn (x) = O (increases as σ → 0)
n σd

   
1
Mean squared error: MSEPn fˆn (x) = O σ + 4
n σd

1
   4

σOPT = n − d+4
=⇒ MSEOPT fˆn (x) = O n − d+4
Summary

Method cvgence rate parameter(s) limitation(s)

p 
parametric

Gaussian model O d/n N/A bias for non-Gaussians


p 
GMMs O d/n mixture size r local maxima
computation cost
nonparametric

 1
 curse of dimensionality
Histogram O n− d+2 number N of bins
computation cost
 2

− d+4
Kernel density O n bandwidth σ here, curse of dimensionality
the curse of dimensionality
What you should know

• Concepts: density estimator, parametric vs. non-parametric,


MSE and bias-variance decomposition, convergence

• Gaussian model: definition, fitting, convergence rates

• Gaussian mixture models: definition, fitting by MLE, convergence rates

• Histograms: definition, bias-variance vs. grid size, convergence rates

• Kernel density estimators: definition, common kernels,


bias-variance vs. bandwidth, convergence rates
Supervised Learning
and k-NN Prediction

Outline:

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance


Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Supervised learning
Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y

Goal: build a predictor f : X → Y from (x1 , y1 ), · · · , (xn , yn )


whose mean prediction error on new query observations is minimal

Y=R
?
regression: Y continuous X =R
classification: Y discrete

?
X = {images}
Y = {labels}
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression) or Y = {1, · · · , κ} (classification)

Note that theY joint


→ X, distribution
are the marginals ofPr(X, encodes theIncomplexity
Y ) distribution.
the joint of the problem:
fact the complexity is encoded in the conditio

X, Y perfectly dependent ⇒ ∃ perfect predictor

X, Y imperfectly dependent ⇒ ∄ perfect predictor

Statistical framework
Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))


n
1 X
s the
→ mean error on
in practice: the input
minimize dataset risk:
empirical L(yi , f (xi ))
n i=1

Popular loss functions for regression (Y = R):

L(yi , f (xi )) Name

(yi − f (xi ))2 squared error (MSE)


|yi − f (xi )| absolute error (MAE)
ss function
( has its own advantages and drawbacks: - square error is differentiable (and easy
1 2
(y − f (xi )) if |yi − f (xi )| < δ
2 i Huber loss
δ |yi − f (xi )| − 12 δ 2 otherwise
re
squa

ab
so
Hu

lu
te
be
r
Statistical framework
Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))


n
1 X
s the
→ mean error on
in practice: the input
minimize dataset risk:
empirical L(yi , f (xi ))
n i=1

Popular loss functions for binary classification (Y = {−1, 1}):

L(yi , f (xi )) Name Classifier


1yi ̸=f (xi ) zero-one Bayes / k-NN green
guess is
whosquare loss — purple
is who?
2
reason for theyi suare
(1 − f (xi ))loss classifiersquare
to have no particular
(noname)name is because it is hardly ever used
max{0, 1 − yi f (xi )} hinge SVM
exp(−yi f (xi )) exponential Boosting
log(1 + regressions
multiple logistic exp(−yi f (xicombined
))) logistic
together, Logistic
Multiplereg.
Layer Perceptrons also optimize the

yi f (xi )
Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Regression with squared error


iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = R (regression)

Risk: R(f ) = E(X,Y ) (Y − f (X))2


 
= EX E(Y |X) (Y − f (X))2 | X (conditioning on X:

P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint

es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
∗  2 
f (x) := argmin
our guess for f (x), andy∈Y E
we take (Y − y) | X = x Y (conditioned
the one that minimizes the expected error
(Y |X)

pointwise mean (vertical


→ minimizer:
is where f ∗of
the choice (x)the E(Y |X) loss
= square [Y |comes
X = x]
into(regression
play. The function)
expression above is the Frechet
X
(best prediction of Y at point X = x is the conditional mean) x
Regression with squared error
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = R (regression)

Risk: R(f ) = E(X,Y ) (Y − f (X))2


 
= EX E(Y |X) (Y − f (X))2 | X (conditioning on X:

P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint

es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
 
f ∗ (x)
our guess for :=and
f (x), argmin
we E(Y |X)
take the (Y −
one y)2minimizes
that | X = x the expected error (conditioned
y∈Y
no control over the regularity of f ∗
this is prescribed by the

→ minimizer:
is where f of
the choice (x)the E(Y |X) loss
= square [Y |comes
X = x]
into(regression
play. The function)
expression above is the Frechet

(best prediction of Y at point X = x is the conditional mean)

k-NN regression
iid
xi ∼ X with values in X = Rd
Hyp: iid unknown probability distributions
yi ∼ Y with values in Y = R (regression)

Y

f (x) = E(Y |X) [Y | X = x]

expectation estimated conditioning on k-NNs of x


by averaging over samples (P(∃ sample at x) = 0) X
x

1 X (regression estimator)
fˆn,k (x) := yi
k (variant:
more responses
generally, weighted
one can chooseby inverse
a set distances to weights
of non-negative x)
xi ∈NNk (x)

Thm:
Bibliographic (universal
note: consistency)
Stone proved [Stone
sufficiency, 1977]
Devroye [Devroye
refined 1982]
it and proved necessity
ounded means that there is some constant η ≥ 0 such that P(|Y | > η) = 0.
Suppose Y is a bounded random variable. Then, the estimator fˆn,k is consistent
Notice that there is no
if that
also and the
onlyresult
if theholds
choice of kambient
in any k→
satisfies d,
= k(n)dimension ∞ and
which is notk/n → 0 as
involved n→
in the ∞. The
bounds.
Note: fˆn,k
the predictor fn,kisconverges ∀xbest
consistenttoif:the , plim fˆn,kpredictor
∈ Xtheoretical (x) = f ∗f(x).
∗ (regression function) in probability
n→∞
Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Classification with 0-1 loss


iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Risk: R(f ) = E(X,Y ) 1Y ̸=f (X)


 
= EX E(Y |X) 1Y ̸=f (X) | X (conditioning on X)

→ minimize risk pointwise (i.e. independently for each value x of X):



again, here y isf our
(x)guess for f (x),Eand
:= argmin we[1take
(Y |X) | Xy=that
Y ̸=ythe x] minimizes the expected conditional
y∈{1,··· ,κ}
Y Xκ

y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...

3 y∈{1,··· ,κ} r=1


2

step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
Classification with 0-1 loss
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Risk: R(f ) = E(X,Y ) 1Y ̸=f (X)


 
= EX E(Y |X) 1Y ̸=f (X) | X (conditioning on X)

→ minimize risk pointwise (i.e. independently for each value x of X):



again, here y isf our
(x)guess for f (x),Eand
:= argmin we[1take
(Y |X) | Xy=that
Y ̸=ythe x] minimizes the expected conditional
y∈{1,··· ,κ}
Y Xκ

y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...

3 y∈{1,··· ,κ} r=1


2

step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
⇒ Bayes error rate R(f ∗ ) is zero when X, Y are perfectly dependent

k-NN classification
iid
xi ∼ X with values in X = Rd
Hyp: iid unknown probability distributions
yi ∼ Y with values in Y = {1, · · · , κ}

Y

f (x) = argmax P(Y = y | X = x) κ
...

y∈{1,··· ,κ} 3

1
argmax determined conditioning on k-NNs of x X
by majority vote (P(∃ sample at x) = 0) x

fˆn,k (x) := argmax # {i : yi = y and xi ∈ NNk (x)}


y∈{1,··· ,κ}

⇒ 1-NN estimator is consistent when



Y perfectly dependent ⇒ R(f ) = 0 = limn X, Y
Thm: (1-NN optimality) [Cover, Hart 1967] X,are perfectly dependent or independent
For classification into κ classes:
 
κ κ−1
upper 0bound
≤ R(f ∗
is that
) ≤ of fˆn,1 ) ≤classifier:
limtheR(random ∗
R(f ) 2given
− a new ∗
R(fobservation
) ≤ x, choosing its lab
n→∞ κ−1 κ
Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Advantages and drawbacks of k-NN in practice


• high flexibility:

▶ little prior on the fitting model


▶ method based on distances / (dis-)similarities (no need for coordinates)

• easiness of implementation:

▶ only a few lines of code for NN-search via linear scan

• extends naturally to other problems:

▶ density estimation:

ˆ (x) := k 1
ominator in thefn,k
second fraction is actually
n Vd ∥x − NNk (x)∥the dvolume of the ball of radius the distance of
2

π d/2 k-th nearest neighbor of x


volume of unit ball: Vd =
Γ( d2 + 1)
Advantages and drawbacks of k-NN in practice
contrast, pre-training
• algorithmic cost based methods (e.g. linear methods) put most algorithmic cost
of prediction:

▶ linear scan in Θ(nd)


▶ sublinear methods become (close to) linear in high dimensions

difference• of k-NN
slow classifier’s
convergence errordimensions
in high rate compared
(cursetoofthe (optimal) Bayes error rate converges
dimensionality):

▶ asymptotic
fact that the asymptoticregime
regime is not
often not attained practice⇝not
in practice
attained in onlytoimposes
need select k:to select

small k leads to overfitting large k leads to underfitting


Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Cross-validation
ransition: one solution to cope with the issue of selecting k is cross-validation, which is

not bound to the


A general k-NNtoclassifier.
method In fact, it is aingeneral
select hyperparameters method
supervised to do
learning hyperparameter
algorithms.

Principle:
-NN we▶threshold
explore the hyperparameter
at an space
upper limit value of k.orWe
a subset
can alsothereof (e.g.thevia
subsample sampling)
range of values

▶ for each choice of hyperparameter value(s):

- train
-NN the training the isclassifier
phase with
trivial, but thethese values
testing phase is costly

- test its performance empirically

▶ keep the value(s) that yield the best performance

In practice:
This is a ▶
very important
partition aspect:
initial dataset and two
T into V must be disjoint,
subsets: because
T (training), it V
and is the prediction power
(validation/test)

▶ do the istraining
When a predictor on T
evaluated , then test hyperparameter
experimentally, values
yet another part on V
of the data must be kept aside

▶ is
Averaging average
important,
its performance
to make theover
result
some
independent
subset of all
of the choice Tof⊔partition.
partitions V The choice
Cross-validation
ransition: one solution to cope with the issue of selecting k is cross-validation, which is

Examples of methods:

Exhaustive means
”exhaustive” cross-validation:
that a subspace of partitions is entirely explored

▶ leave-one-out: 1 observation is reserved for validation ⇝ n partitions

▶ leave-p-out: p observations are reserved for validation ⇝ ( np ) partitions

Non-exhaustive cross-validation:

▶ r-fold: randomly partition the observations into r subsets of size n/r


this means is that weleave-one-out
apply reserve 1 subset for validation
on the and use
r subsets (most the rest of
commonly r= the10)
data for training.

▶ bias
of large holdout:
here, use
duesingle
to therandom
use of apartition T ⊔ V (each pt assigned independently)
single partition)
▶ Monte-Carlo: repeatedly use random partitions T ⊔ V
Outline

• Foundations of supervised learning

• Regression with squared error and k-NN regression

• Classification with 0-1 loss and k-NN classification

• Advantages and drawbacks of k-NN predictors in practice

• Cross-validation

• Evaluating a classifier’s performance

Evaluating a classifier’s performance


Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym

}:

1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m

this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class

1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
1 2 ··· κ
1
2
predictions

In this plot, the colors (blue → red) represent the fraction of the
···

κ
classes
Evaluating a classifier’s performance
Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym

}:

1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m

this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class

1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
▶ true positives (TP) for class i: points of this class correctly predicted in i

▶ false
These measures arepositives (FP)dependent
asymmetric, for class i:on
points of other
the class classes incorrectly predicted in i
considered
▶ true negatives (TN) for class i: points of j ̸= i predicted in l ̸= i (possibly with l ̸= j)

▶ false negatives (FN) for class i: points of i predicted in some j ̸= i

Evaluating a classifier’s performance


Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym

}:

1
• error rate: τerr := m
#{misclassified points}
1
accuracy:
de •succes” τacc
(et non := 1 − τerrin =
”precision”) #{correctly classified points}
French
m

this kind ▶
of biased when
situation the classes have significantly
performance different the
measure priviledges sizes 1/106 for
(e.g.common
most sick vs. healthy)
class

1
• confusion
column representsmatrix: Ci,j each
a true class, #{points
:= row ofaclass
represents j predicted
prediction. as being in isi}the richest,
This representation
m
TP
▶ precision
measures reliability / positive
of positive predicted
predictions value:ofPPV
(fraction true:=
positives among the positive predictions)
TP + FP
for binary
TP
▶ recall
measures ability / sensitivity
to capture / true instances
the positive positive rate:
(fraction
TPR of :=
positive predictions among the p
classification
TP + FN
forone-vs.-all
or multi-class, on
measures tendency to /predict negativesrate:
as positives FP
▶ fall-out false positive FPR :=(fraction of positive predictions among the
TN + FP
Evaluating a classifier’s performance
Given a test set V = {x′1 , · · · , x′m } and known responses {y1′ , · · · , ym

}:

2 2 · PPV · TPR
• F-score: FS := 1 1 = (harmonic mean of prec. & rec.)
PPV
+ TPR
PPV + TPR

▶ biased
viledges classifiers with high precision
towards positives and recall, i.e. ability to capture positives and reliability of
perfect classifier
1
• receiver
a classifier operating
that estimatescharacteristic
the posterior (ROC) curve:
probability of class +1 then chooses the label +1

er
sifi
er

as
sifi
▶ plots recall
the classifier’s curve (TPR)
falls under the fall-out
versus diagonal, then it becomes worse than the random classifier,
(FPR)

cl
as

om
cl

nd
TPR
▶ perfect
erfect classifier has classifier
FP = FN has=TPR
0, hence
= 1 Tand
P RFPR and
= 1= 0 F P R = 0.

ra
▶ random
om classifier classifier
classifies eachhas
point
TPRindependently
= FPR as +1 or −1 according to a random variable
AUC
▶ 1AUC: area under the ROC curve
AU C ≤
0 FPR 1
What you should know

• Concepts: classification, regression, loss function, risk, empirical risk

• Regression function, k-NN regressor and its consistency

• Bayes classifier, k-NN classifier and its optimality

• Advantages and drawbacks of k-NN predictors

• Cross-validation: principle, main methods (exhaustive, non-exhaustive)

• Evaluation criteria: accuracy, confusion matrix, precision/recall/fall-out,


F-score, ROC curve
Linear Models for Regression

Outline:

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels


Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Supervised learning (regression)


Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y

Goal: build a predictor f : X → Y from (x1 , y1 ), · · · , (xn , yn )


whose mean prediction error on new query observations is minimal

Y=R

?
X =R

regression: Y continuous
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression)

Note that theY joint


→ X, distribution
are the marginals ofPr(X, encodes theIncomplexity
Y ) distribution.
the joint of the problem
fact the complexity is encoded in the conditio

X, Y perfectly dependent ⇒ ∃ perfect predictor

X, Y imperfectly dependent ⇒ ∄ perfect predictor

Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = R (regression)

Note that theY joint


→ X, distribution
are the marginals ofPr(X, encodes theIncomplexity
Y ) distribution.
the joint of the problem
fact the complexity is encoded in the conditio

Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))


n
1 X
s the
→ mean error on
in practice: the input
minimize dataset risk:
empirical L(yi , f (xi ))
n i=1
MSE

reminder: - MSE = Mean Squared Erro

M
Hu

AE
be
r
Regression with squared error
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = R (regression)

Risk: R(f ) = E(X,Y ) (Y − f (X))2


 
= EX E(Y |X) (Y − f (X))2 | X (conditioning on X:

P(X, Ythe
decompose X) P(X)) P(
P(Y |probability
) = joint

es of→ minimize
f can be setrisk pointwise
pointwise (i.e. independently
independently for each
(no regularity value x ofon
assumption X):
f for now) ⇒ minimize
 
f ∗ (x)
our guess for :=and
f (x), argmin
we E(Y |X)
take the (Y −
one y)2minimizes
that Y (conditioned
| X = x the expected error
y∈Y
pointwise mean (vertical
→ minimizer:
is where f ∗of
the choice (x)the E(Y |X) loss
= square [Y |comes
X = x]
into(regression
play. The function)
expression above is the Frechet
X
(best prediction of Y at point X = x is the conditional mean) x

Advantages and drawbacks of k-NN in practice


• high flexibility:

▶ little prior on the fitting model prior: linear model


▶ method based on distances / (dis-)similarities (no need for coordinates)
requires linear space hence coordinates
• easiness of implementation:

▶ only a few lines of code for NN-search via linear scan


easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k


Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Linear model for regression


Hyp:
noise Y depends
variable linearlytoonbeXindependent
ε is assumed plus some independent noise X.
from the variable ε: Note that the concep

Xd
thereforeY a=parametric
β0 + approach:
X j βj + ε we assume the underlying predictor f belongs to the family
j=1

intercept independent, centered, real random variable (noise)

(E[ε] = 0) that the noise is centered makes


The assumption
Y ∈R

β0

X ∈ Rd
0
Linear model for regression
Hyp:
noise Y depends
variable linearlytoonbeXindependent
ε is assumed plus some independent noise X.
from the variable ε: Note that the concep
 β0 
Xd β1
Y = β0 + Xj βj + ε = [ 1 X ] β + ε where β = ..  ∈ Rd+1
T 
j=1 .
βd

parameter to be estimated from the data: β̂

Y ∈R

Note
Linearthat
predictor:
we do not need
fβ̂ (x) :=ε̂[ in
1 xour predictor since
T ] β̂

β̂0
β0

X ∈ Rd
0

Ordinary least squares (OLS) estimator


Input: (x1 , y1 ), · · · , (xn , yn ) ∈ Rd × R

−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor

n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1

residual sum of squares (RSS)


Ordinary least squares (OLS) estimator
Input: (x1 , y1 ), · · · , (xn , yn ) ∈ Rd × R

−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor

n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1

residual sum of squares (RSS)


" y1 #  T
1 x1

RSS(β) = ∥y − X β∥22 where y = ... ∈ Rn and X =  ..  ∈ Rn×(d+1)


.
yn 1 xT
n

quadratic function of β response vector coordinates matrix


with 1’s appended

Ordinary least squares (OLS) estimator


Input: (x1 , y1 ), · · · , (xn , yn ) ∈ Rd × R

−→ estimate
note again β byneed
that we only minimizing the empirical
to estimate risk
β in order to with MSE:
define our linear predictor

n n
1 X 2 1 X
β̂ :=
choose the β argmin
whose (yi − flinear
corresponding β (xi ))predictor
= fˆβ(yminimizesT
β)2MSE (or equivalently the
i − [ 1 xi ]the
β∈Rd+1 n i=1 n i=1

residual sum of squares (RSS)


" y1 #  T
1 x1

RSS(β) = ∥y − X β∥22 where y = ... ∈ Rn and X =  ..  ∈ Rn×(d+1)


.
yn 1 xT
n

{
gradient vector ∇
ofRSS(β)

of2 RSS
Hessian matrix ∇
β −2 XT (y − X β)
RSS at =

at β= 2 XT X
RSS(β) thispositive
minimizers satisfy
indeed, they

implies semi-definite
XT (ypoints
are critical

⇒ convex
that the functional
− X β)of=the

is convex
0 functiona

but possibly
functional
Ordinary least squares (OLS) estimator
Nondegeneracy assumption: matrix X has full column rank

⇒ 2 XT X is positive definite
−1
⇒ β̂ = XT X XT y is the unique minimizer

    −1 T
▶ predictor: fβ̂ (x) = 1 xT β̂ = 1 xT XT X X y

" y1 #  T
1 x1

RSS(β) = ∥y − X β∥22 where y = ... ∈ Rn and X =  ..  ∈ Rn×(d+1)


.
yn 1 xT
n

{
gradient vector ∇
ofRSS(β)

of2 RSS
Hessian matrix ∇
β −2 XT (y − X β)
RSS at =

at β= 2 XT X
RSS(β) thispositive
minimizers satisfy
indeed, they

implies semi-definite
XT (ypoints
are critical

⇒ convex
that the functional
− X β)of=the

is convex
0 functiona

but possibly
functional

Ordinary least squares (OLS) estimator


Nondegeneracy assumption: matrix X has full column rank

⇒ 2 XT X is positive definite
−1
⇒ β̂ = XT X XT y is the unique minimizer

    −1 T
▶ predictor: fβ̂ (x) = 1 xT β̂ = 1 xT XT X X y y Rn

▶ fitted values:
  vd
∀i, ŷi := fβ̂ (xi ) = 1 xTi β̂
··

−1
·

ŷ := X β̂ = X XT X XT y ŷ col X
v0

⟨v0To
, · · see
· , vthat Rn product "
subspace
(ortho.
is precisely
proj. ontothethe
column
linear space
subspace
of X. d ⟩ ⊆ the spannedofbymatrices
the input
does
variables
correspond
this means 0 ≡ v1)
and vthat 0 =
Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Optimality
Asumptions:
iid
(x1 , y1 ), · · · , (xn , yn ) ∼ (X, Y ) taking values in Rd × R (regression)
Y = [1 XT ] β+ε (linear model)

E[ε] = 0 and Var(ε) < +∞ (centered noise with finite variance)

Then: (Gauss-Markov Theorem)

expectation
the is taken
OLS over allβ̂possible
estimator E(X,y)(x[β̂]
samplings
is unbiased: =1 ),β · · · (xn , yn ), and computed component-w
1, y

the OLS estimator minimises the MSE among all linear estimators:

β̂ ∈ argmin
Recalling the bias-variance decomposition β∥22
E(X,y) ∥β̃ −theorem from Lecture 5, this result implies that, among
β̃

connection
⇒between MSE
β̂ is a best and unbiased
linear variance comes from
estimator the assumption that the estimators are
(BLUE):

that, since β̃Var(


andβ̃)β̂ −are vector-valued
Var( random
β̂) is positive variables,
semi-definite fortheir variances linear
any unbiased are actually covarianc
estimator β̃
Evaluation in practice
1
Pn
Let data
given ȳ :=set
n
with yi be the(xempirical
i=1responses 1 , y1 ), · · · mean
, (xn , yresponse,
n ), we canand
run ŷa := the predictions
Xβ̂regression
linear then see how

Pn h1i 2
2
• total
a factor of sum of issquares:
n, TSS TSS variance
the empirical Y i. −
:= i=1of (y y −empirical
ȳ) ȳ=is the
Here, ȳ | mean. Another way to
1 2
Pn h1i 2
2
• explained(upsum
is interpretated to aoffactor
squares:
n) as ESS (ŷi − ȳ)
:= i=1variance
the explained ŷ −mean,
= the
around ȳ | where ”explained”
1 2
Pn 2 2
• residual
we have sum of
seen already, squares:
when RSS
we talked about i=1 (ŷi −the
:= minimizing = ∥ŷ
yi )MSE on− y∥27. Indeed, RSS is the squa
page

thatProp: TSS = ESS


the observations +responses
and RSS. here are the 
ones of the training set, on which the performance
1
y − ȳ |
1 Rn

√ √
In this pictures,
TSS the quantities associated with each edge of the triangle
RSS
vd

√  
··
1
  ESS
·
1 ŷ − ȳ |
ȳ | 1 v0
1

Evaluation in practice
1
Pn
Let data
given ȳ :=set
n
with yi be the(xempirical
i=1responses 1 , y1 ), · · · mean
, (xn , yresponse,
n ), we canand
run ŷa := the predictions
Xβ̂regression
linear then see how

Pn h1i 2
2
• total
a factor of sum of issquares:
n, TSS TSS variance
the empirical Y i. −
:= i=1of (y ȳ) ȳ=is
Here, y −empirical
the ȳ | mean. Another way to
1 2
Pn h1i 2
2
• explained(upsum
is interpretated to aoffactor
squares:
n) as ESS (ŷi − ȳ)
:= i=1variance
the explained ŷ −mean,
= the
around ȳ | where ”explained”
1 2
Pn 2 2
• residual
we have sum of
seen already, squares:
when RSS
we talked about i=1 (ŷi −the
:= minimizing = ∥ŷ
yi )MSE on− y∥27. Indeed, RSS is the squa
page

thatProp: TSS = ESS


the observations +responses
and RSS. here are the ones of the training set, on which the performance

▶ fraction of variance unexplained: FVU := RSS


TSS
∈ [0, 1]

notation▶Rcoefficient
2 means that,
ofindetermination: R2 := model,
fact, in the least-square ESS
TSS
1 2−isFVU
=R ∈ [0, 1]
the square of the (sample version

R close to 1 ⇒ good fit


Generally speaking, R2 is a statistic that gives information about the goodness of fit of the linear model
R close to 0 ⇒ bad fit
Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)

▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the

▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)

- the observations xi (circled dots)are located in the horizontal plane (d = 2), while the

R d β̂
β̂

β̂
Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)

▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the

▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)

▶ solution 1: dimensionality reduction: (cf. Lecture 10)

• preprocessing: estimate the subspace ⟨x1 , · · · , xn ⟩

• main: solve optimization problem within the estimated ⟨x1 , · · · , xn ⟩

Degenerate settings
meansQ:that
what
theif dthe
+ 1coordinates
columns arematrix X doesindependent
not linearly not have full column rank?
(happens typically with perfectly correlated variables or when n < d)

▶the
rm of an gradient
affine subspace of minimizers
of the predictor β̂ subspace l supporting the observations is proportional
along the

▶ one choice of β̂ is more natural: the one within the subspace ⟨x1 , · · · , xn ⟩
(a.k.a. the one with smallest norm)

▶ solution
alternative 2: regularized
approach linear
does not regression:
require a preprocessing step, as it tries to align β̂ with
ℓ2 the
n
1 X
easy to •optimize
ridge: - distribution (yi −
of coefficients
β̂ := argmin over
[ 1 the β)2 +variables
i ] input
xT λ ∥β∥22 tends to have full supp
β∈Rd+1 n i=1

n
1 X 2
rder to•optimize
lasso: (non-differentiable)
β̂ := argmin (yi − [ 1 xTi of
+ distribution ] β) + λ ∥β∥over
coefficients 1 the input variables
β∈Rd+1 n i=1 ℓ1
n
1 X 
• elasticthe
trade-off between previous
net: two: for small (y
β̂ := argmin i − [values
norm 1 xT the2 ℓ+
i ] β)
1
term ∥β∥22 + (1 −therefore
λ α dominates, α) ∥β∥1 go
β∈Rd+1 n i=1
Ridge regression
n
1 X
β̂ := argmin (yi − [ 1 xT
i ] β)2 + λ ∥β∥22
β∈R d+1 n i=1

= argminβ∈Rd+1 ∥y − X β∥22 + λ ∥β∥22


{ ∇ · (β) = −2 XT (y − X β) + 2λ β

∇2 · (β) = 2 XT X + 2λ Id+1

 −1
positive definite for any λ > 0
This is because the first term is positive semi-defin
⇒ strictly convex functional

▶ β̂ = XT X + λ Id+1 XT y

▶ algorithms:
omposition LU invertible
works on any decomposition,
matrix.Cholesky decomposition
It decomposes the matrix into a product of Lo
— O(n3 ) by Gaussian elimination, O(nω ) by divide-and-conquer

▶ in practice: use library for linear algebra (e.g. LAPACK, Eigen)


Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Non-linear regression using basis functions


remain in the realm of parametric methods: we assume the predictor belongs to some pa

Q: what if X, Y are dependent through some non-linear function?


Y
2
Example: Y = X + X + 1 parametric families:
polynomials,
In this example we applied RBFs,
a polynomial etc.
transformation
1
X
1
▶ transform initial variables:
Y
X1′ := X this is the hyperplane z = x
Y = X2′ + X1′ + 1
X2′ := X 2
1
X2′
1
this is the curve y = x2 ,

1 X1′
▶ solve linear regression with transformed variables:

Observe that the


fβ̂ transformations
(x) := [ 1 x′T ] β̂ dowhere { x′ = (x, x2 ) ∈ R2
not involve β,
Hereβ̂X
:=
so the objective function
′ argmin
β∈R3 ∥y − X
is the transformed
′ 2 remains a2 quadratic
β∥2 of
version λ ∥β∥
+ X: its 2columns a
Outline

• Reminder about supervised regression

• Linear model for regression and Ordinary Least Squares estimator

• Optimality and practical evaluation

• Degenerate settings and regularization

• Parametric non-linear regression using basis functions

• Non-parametric non-linear regression using kernels

Non-linear regression using kernels


ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)

reproducing
Terminology:
property
• feature space H, feature map Φ
• feature vectors Φ(x)
• kernel k := ⟨Φ(·), Φ(·)⟩H : Rd × Rd → R

Rd Φ The kernel k is the pullback

H
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)

Prop: The kernel of any RKHS on Rd is unique.uniqueness of the kernel implies uniquene
Conversely, k is the kernel of at most one RKHS on Rd .

▶ Φ(x) = k(x, ·)

Non-linear regression using kernels


ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)

Prop: The kernel of any RKHS on Rd is unique.uniqueness of the kernel implies uniquene
Conversely, k is the kernel of at most one RKHS on Rd .

Thm: [Moore 1950] k : Rd × Rd → R is a kernel iff it is positive (semi-)definite,


i.e. ∀n ∈ N, ∀x1 , · · · , xn ∈ Rd , the Gram matrix (k(xi , xj ))i,j is positive
semi-definite.
in other words, the Gram matrix (k(xi , xj ))i,j is positive semi-definite
Examples:
• linear: k(x, y) = ⟨x, y⟩ H = (Rd )∗ , Φ(x) = ⟨x, ·⟩
X  
N N n1 n d n1 nd
• polynomial: k(x, y) = (1 + ⟨x, y⟩) = n1 ,··· ,nd x1 · · · xd y1 · · · yd
n1 +···+nd =N
  ∝ coord. of Φ(x)
∥x−y∥2
• Gaussian: k(x, y) = exp − 2
, σ > 0. H ⊂ L2 (Rd )
2σ 2
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)

Thm: (Representer) [Kimeldorf, Wahba 1971] [Schölkopf et al 2001]


Given RKHS H with kernel k, any function f ∗ ∈ H minimizing
Pn
1
n i=1 L(yi , f (xi )) + λ ∥f ∥2H
Pn
is of the form f ∗ (·) = j=1 αj k(xj , ·), where α1 , · · · , αn ∈ R.

R
underlying intuition is the same as for the regularization: the minimizer can be chosen in the
β̂ H β̂

Non-linear regression using kernels


ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Hilbert X → R
d
A Hilbert function space H ⊂ RR is H is a subspace
a reproducing of functions
kernel
space (RKHS) on Rd if ∃Φ : Rd → H s.t.:
∀x ∈ Rd , ∀f ∈ H, f (x) = ⟨f, Φ(x)⟩
HH contains the functions kx = k(x, ·)

Thm: (Representer) [Kimeldorf, Wahba 1971] [Schölkopf et al 2001]


Given RKHS H with kernel k, any function f ∗ ∈ H minimizing
Pn
1
n i=1 L(yi , f (xi )) + λ ∥f ∥2H
Pn
is of the form f ∗ (·) = j=1 αj k(xj , ·), where α1 , · · · , αn ∈ R.

n n
! n
1X X X
▶ argmin ∥H Lby yits
i , square,
αj k(x j , xi )does
+ not
λ changeαi αanything
j k(xi , xj )since Ω can be chosen
we have replace
α n∥fi=1 j=1
which
i,j=1

" α1 #
only the k(xi , xj ) are
where α = ... required to minimize
αn (kernel trick)
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Case of regression with squared error:

n n
!2 n
X X X
argmin yi − αj k(xj , xi ) + λ αi αj k(xi , xj )
α
i=1 j=1 i,j=1

= argmin ∥y − Kα∥2 + λ αT Kα where Kij := k(xi , xj )


α

▶ α̂ = (K + λ In )−1 y

Xn
▶ fˆ(x)
expression α̂j k(xj , x)as follows: instead of fixing the class of the estimator fˆ a prio
can=be interpreted
j=1

Non-linear regression using kernels


ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Experimentation: y = sin((x − 49.5)/3)


x − 49.5

left-right symmetry, the line produced by linear regression must be horizontal (i.e.
linear regression the linear
Gaussian predicto
kernel
σ = 0.1
This is linear regression

Higher values of σ lead to higher bias.


σ=1 σ = 10
Non-linear regression using kernels
ransition: now we turn to a nonparametric approach. More precisely, we still reembed

Experimentation: y = sin((x − 49.5)/3) + ε where ε ∼ N (0, 0.1)


x − 49.5

time, 0.1 leads to overfitting (tooσ high


= 2.5variance) due to the noise. Here again,Gaussian
this parameter
kernel value
σ = 0.1

σ=1 σ = 10
What you should know

• Linear regression model and OLS estimator

• Gauss-Markov theorem, TSS = ESS + RSS, FVU, R2

• Ridge regression, existence of Lasso regression

• Principles of non-linear regression:

- basis functions
- kernels: definition, Moore and Representer theorems (kernel trick)
Linear Models for Classification

Outline:
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Supervised learning (classification)


Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y

Goal: build a predictor f : X → Y from (x1 , y1 ), · · · , (xn , yn )


whose mean prediction error on new query observations is minimal

?
X = {images}
Y = {labels}

classification: Y discrete
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Note that theY joint


→ X, distribution
are the marginals ofPr(X, encodes theIncomplexity
Y ) distribution.
the joint of the problem
fact the complexity is encoded in the conditio

Y
κ
...

3
X, Y perfectly dependent ⇒ ∃ perfect predictor
2

1
X
Y
κ

...
3
X, Y imperfectly dependent ⇒ ∄ perfect predictor 2

1
X
x

Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Note that theY joint


→ X, distribution
are the marginals ofPr(X, encodes theIncomplexity
Y ) distribution.
the joint of the problem
fact the complexity is encoded in the conditio

Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))


n
1 X
s the
→ mean error on
in practice: the input
minimize dataset risk:
empirical L(yi , f (xi ))
n i=1
Classification with 0-1 loss
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Risk: R(f ) = E(X,Y ) 1Y ̸=f (X)


 
= EX E(Y |X) 1Y ̸=f (X) | X (conditioning on X)

→ minimize risk pointwise (i.e. independently for each value x of X):



again, here y isf our
(x)guess for f (x),Eand
:= argmin we[1take
(Y |X) | Xy=that
Y ̸=ythe x] minimizes the expected conditional
y∈{1,··· ,κ}
Y Xκ

y P(Y
1r̸=of
κ
the sum over r is=just
argmin
the expression = r | X = x)
the expectation (Y categorical variable)
...

3 y∈{1,··· ,κ} r=1


2

step
1 is where the = argmin
choice − P(Y
of the10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))

Advantages and drawbacks of k-NN in practice


• high flexibility:

▶ little prior on the fitting model prior: linear model


▶ method based on distances / (dis-)similarities (no need for coordinates)
requires linear space hence coordinates
• easiness of implementation:

▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LSH,
easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on pre-training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k


Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Linear methods for classification


Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
course ▶
ona linear
linearregression methods,linear
classifier produces decision
we said that a boundaries
method is considered as linear if the co

f −1 ({1})

colored dots are the input observations with responses. The colored areas are the fibers of

f −1 ({2})

linear nonlinear
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
course ▶
ona linear
linearregression methods,linear
classifier produces decision
we said that a boundaries
method is considered as linear if the co

2 types of approaches:
▶ model
eferring to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA

f −1 ({1})
δ δ3
e show the result of the LDA on the Iris dataset: - left:1 the dataset with the decision bounda
δ2

f −1 ({3})
f −1 ({2})

Linear methods for classification


Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
course ▶
ona linear
linearregression methods,linear
classifier produces decision
we said that a boundaries
method is considered as linear if the co

2 types of approaches:
▶ model
eferring to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA

▶ model the separating hyperplane(s) directly


e.g. SVM, perceptron, decision tree
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Linear regression for classification


Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

n
X
Fit the model by least squares: B̂ := argmin ∥Z(yi ) − [ 1 xTi ] B∥22
B
i=1

vector has Z(y


where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

Fit the model by least squares: B̂ := argmin ∥Z − X B∥2F


B
P
(Frobenius
This norm:acts
norm basically as2F the
∥M ∥ := ℓ2 -norm2 ) the
i,j Mijin

vector has Z(y


where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w
" Z(y1 ) #
▶ input
our new input of responses:
indicator responsea matrix:
binary matrix .. exactly
Z := with ∈ Rn×κone nonzero entry per ro
.
Z(yn )

▶ random row random


the corresponding indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent on

Linear regression for classification


Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

Fit the model by least squares: B̂ := argmin ∥Z − X B∥2F


B
This norm basically acts as the ℓ2 -norm in the
 −1
p 2 T
Otherwise we can
Unique regularize
minimum by some
(assuming (e.g. rank
fullℓcolumn ℓ ) as
in in
X ordinary
X): B̂ linearT
XT Zto get
= X regression,
X

▶ row vector of estimated discriminant functions:

-th column inδ̂(x)


B̂ corresponds
:= [ δ̂1 (x) ···toδ̂κthe
(x) ] fitted
= [ 1 xparameter
T ] B̂ vector β̂y , hence the y-th entry inδ̂(x)

▶ classifier: fˆ(x) := argmax δ̂y (x)


y∈{1,··· ,κ}
Linear regression for classification
Experimental
sult looks reasonable,results:
as the two
n=classes
100 +are not(mixture
100 linearly of
separable.
2 Gaussians), d = 2

linear regression Bayes classifier


error rate ≈ 34% error rate ≈ 21%

Linear regression for classification


Experimental results:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

error
surprising, as the ≈ 33%
rate are
classes indeed linearly separable. For
diagonal
comparison,
cross-section
the Bayes error rate

In the plot on the right-hand side, the abscissa


δ̂1

δ̂2

This plot explains the phenomenon: the discriminant function of class 2 never
δ̂3
Linear regression for classification
What is happening:

▶ the δ̂y (x) supposedly model the posterior probabilities P(Y = y | X = x)


(cf. Bayes classifier)

▶ theythat
assuming sumthe
up set
to 1of(in
observations is centered in Rd (mean= 0)
centered model)

▶ they may fall outside [0, 1] δ̂1

δ̂2

δ̂3
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Logistic regression for binary classification


Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
forces δ1 (x) ∈ [0, 1]
▶ δ2 (x) := 1 − δ1 (x)

forces δ1 (x) + δ2 (x) = 1, hence δ2 (x) ∈ [0, 1]


Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 ) i.e. makes the


symmetric problem
problem in 1,invariant
2 under

Logistic regression for binary classification


Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
”sigmoid” of the logistic
means ”S-shaped” (and bounded) → the logistic sigmoid is but one example
sigmoid:

▶ monotonic homeomorphism R → (0, 1)

▶ σ −1 (u) = ln u
1−u
(logit function)

▶ σ(t) + σ(−t) = 1

▶ σ ′ (t) = σ(t) (1 − σ(t))


Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the distribu

P(Y = 1 | X = x) = δ1 (x) = σ ([ 1 xT ] β1 )

P(Y = 2 | X = x) = 1 − δ1 (x) = δ2 (x) = σ (− [ 1 xT ] β1 )

Logistic regression for binary classification


Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the distribu

▶ model
means that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, then

P(Y = 1 | X = x) σ ([ 1 xT ] β1 ) 1 + exp ([ 1 xT ] β1 )
ln = ln = ln factorize the numerator
= [ 1 xT by
] β1exp(
P(Y = 2 | X = x) σ (− [ 1 xT ] β1 ) 1 + exp (− [ 1 xT ] β1 )
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
that, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the distribu

▶ model
means that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, then

▶ yields linear decision boundary: δ1 (x) = δ2 (x) ⇐⇒ δ1 (x) = 1/2


−1
⇐⇒ [ 1 xT ] β1 = σ
i.e. (1/2)
[ 1 x T ] β1 =

Logistic regression for binary classification


Modelseen
have already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on densit

∀i, L(yi ; xi , β1 ) := P(Y = yi | X = xi ; β1 )

n
Y
⇒ L((yi )n n
i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
⇒ log L((yi )n n
i=1 ; (xi )i=1 , β1 ) = log P(Y = yi | X = xi ; β1 )
i=1

Change
is just of variable:
to have a response Z := 1Yvalues
taking =1 ∈ in
{0,{0, ∀i, of
1} 1} instead zi {1, 1yi =1
:= 2}, ∈ {0,
which is 1}
more convenient

⇒ ∀i, log P(Z = zi | X = xi ; β1 )


= zi log P(Z = 1 | X = xi ; β1 ) + (1 − zi ) log P(Z = 0 | X = xi ; β1 )
= zi log σ ([ 1 xTi ] β1 ) + (1 − zi ) log σ (− [ 1 xTi ] β1 )
zi [ 1 the
result is deduced=from xT β1 −
i ]line log (1after
above + exp ([ 1 xlines
a few i ] βof
T
1 ))calculation, in which the expression
eveloped as follows (where t := [ 1 xi ] β1 ):
T

exp(t) 1
Logistic regression for binary classification
Modelseen
have already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on densit

estimatorβ̂for is the one


β1argmax
1 := ℓ(βthat maximizes
1 ) where thelog
ℓ(β1 ) := L ((yi )n
log-likelihood, n
renamed
i=1 ; (x ) 1 ) for simplicity
i )i=1 , β1ℓ(β
β1
Xn
that this functional has the same shape as=the one
zi based
[ 1 xTi ]on − log (1 + exp
β1minimizing the empirical
([ 1 xTi ] β1risk
)) using
i=1

n
X  

{
1
∇ ℓ(βat
gradient vector 1) β
=1 (zi − σ ([ 1 xTi ] β1 )) xi
i=1

Xn
2  
∇ ℓ(β
Hessian matrix at1 )β1=. −
Here we
σ ([use the
1 xT
i ] βfact − σσ([′ 1=xTiσ(1
1 ) (1that ] β− σ) x1i [ 1 xTi ]
1 ))
i=1

negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated a concave
ℓ(β1 )ofispositive function matrices with non-negative
semi-definite

▶ choose
the Hessian matrix is negative
β̂1 arbitrarily in semi-definite, only
the solution set ∇ ℓ(β
of the global
1 ) =maxima
0 annihilate the gradient.

d + 1 non-linear equations in β1

Logistic regression for binary classification


Modelseen
have already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on densit

estimatorβ̂for is the one


β1argmax
1 := ℓ(βthat maximizes
1 ) where thelog
ℓ(β1 ) := L ((yi )n
log-likelihood, n
renamed
i=1 ; (x ) 1 ) for simplicity
i )i=1 , β1ℓ(β
β1
Xn
that this functional has the same shape as=the one
zi based
[ 1 xTi ]on − log (1 + exp
β1minimizing the empirical
([ 1 xTi ] β1risk
)) using
i=1

Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix

init: β̂is1 a←−


the null vector good
0 first
//orguess
someinarbitrary
general vector

repeat:
 −1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian

until convergence // requires convergence threshold

Thm: if aHessian
non-singular maximum implies s.t. ∇2 definite
β̄1 isnegative ℓ(β̄1 ) is Hessian, hencethen
non-singular, the β̄
map unique concave
is strictly
1 is ℓthe max.
and,to
close enough forβ̄1an initial
, the β̂1 close
Hessian at β̂enough to β̄1 , convergence
1 is non-singular
to β̄1the
as well, hence is quadratic.
algorithm proceeds
Logistic regression for binary classification
ransition:Degenerate cases
in fact, as the (singular
example Hessian): applying the vanilla logistic regression may lead
is degenerate,

▶ regularized logistic regression:


X n
p
that here weβ̂have
1 := a minus sign[zbecause
argmax i [ 1 xi ]we
T
− log
β1 are maximizing a 1quantity,
(1 + exp ([ as −
i ] β1 ))]
xT λ ∥β1 ∥to
opposed p linea
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fact,
n
X  

{
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xTi ] β1 )) xi − 2λ β1
i=1
n
X
2  1

∇ ℓ(β1 ) = − σ ([ 1 xTi ] β1 ) (1 − σ ([ 1 xTi ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

negative definite ⇒ strictly concave functional

Logistic regression for binary classification


ransition:Degenerate cases
in fact, as the (singular
example Hessian): applying the vanilla logistic regression may lead
is degenerate,

▶ regularized logistic regression:


X n
p
that here weβ̂have
1 := a minus sign[zbecause
argmax i [ 1 xi ]we
T
− log
β1 are maximizing a 1quantity,
(1 + exp ([ as −
i ] β1 ))]
xT λ ∥β1 ∥to
opposed p linea
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fact,
n
X  

{
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xTi ] β1 )) xi − 2λ β1
i=1
n
X
2  1

∇ ℓ(β1 ) = − σ ([ 1 xTi ] β1 ) (1 − σ ([ 1 xTi ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

▶ apply Newton-Raphson, with guaranteed quadratic convergence


(small values λ may lead to numerical instabilities in practice though)
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Multi-class logistic regression


the Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes against

{
P(Y = 1 | X = x)
ln = [ 1 x T ] β1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
generalized sigmoid
(softmax)
This ∈ [0,is1]the one
terminology
1
δκ (x) := P(Y = κ | X = x) = P
1+ z<κ exp ([ 1 x ] βz )
T

P
= 1 − y<κ
rces (again) the discriminant functions
δy (x) to sum up to 1. Note that, although it is not as direct as
Multi-class logistic regression
the Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes against

{
P(Y = 1 | X = x)
ln = [ 1 x T ] β1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )

1
δκ (x) := P(Y = κ | X = x) = P
1+ z<κ exp ([ 1 x ] βz )
T

▶ estimate
expressions for the B by maximum
objective functionlikelihood and
and for the Newton-Raphson’s
iteration steps are more algorithm.
complicated than in the bina

Multi-class logistic regression


Back to our running experiment:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

linear (error rate ≈ 33%) logistic (error rate ≈ 3%)


Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Support Vector Machines (SVM)


Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Hyperplane equation: xT β − β0 = 0
1

0
=

=
0
β

0
β

▶ parameters: β ∈ Rd \ {0}, β0 ∈ R


β
xT

β
xT

1

▶ β is normal to the hyperplane


0
=
β

1

∥β∥
β

▶ β0
is the shift from the origin along β
xT

∥β∥ β

▶ fix
There is one 1
degree
∥β∥
of be
to freedom in the hyperplane equation, since the solution set is invariant under
the margin

⇒ maximizing the margin is equivalent β0 0


∥β∥
to minimizing ∥β∥ or ∥β∥2

⇒ slab
is another boundaries of
consequence have
ourequations xTthat
convention β−β 0 = ±1
1/∥β∥ is set to be the margin.
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
0
(xi , 1)

0
β

2


β̂, β̂0 := argmin ∥β∥ subject to:

β
xT

β
xT

1
β,β0


(maximize margin)

=
(

0
xTi β − β0 ≥ 1

β
∀i s.t. yi = 1 1


∥β∥

β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β


T
⇔ yiis xequivalent (xi , −1)
constraint i β − β0 to
≥ 1the ∀i = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β0 0
∥β∥

▶ quadratic programming problem ⇝


solvers: ellipsoid, interior point, etc.
problems have definite
(w/ pos. quadratic objective
quadratic functions and linear constraints (equalities or inequalities).
form)

Support Vector Machines (SVM)


Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):


1

0
=

=
βˆ
0

(xi , 1)
βˆ
0

2

βˆ

β̂, β̂0 := argmin ∥β∥ subject to:


βˆ
xT

xT

β,β0

(maximize margin)
=

(
βˆ
0

xTi β − β0 ≥ 1 ∀i s.t. yi = 1 1

∥β̂∥
βˆ
xT

xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β̂


T
⇔ yiis xequivalent (xi , −1)
constraint i β − β0 to
≥ 1the ∀i = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β̂0 0
∥β̂∥
 
▶ quadratic programming problem ▶ classifier:
The ˆ
class/label of =
f (x) a query T
x β̂ −x β̂is0 determin
sign point
problems have definite
(w/ pos. quadratic objective
quadratic functions and linear constraints (equalities or inequalities).
form)
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

When classes are not linearly separable


such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.

n  o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints

(loss compared to a slab excluding observation xi )

Relaxed optimization (soft margin):


1

0
=

=
0

mixing (trade-off) (xi , 1)


β

0
β

β̂, β̂0 := argmin


parameter
β
xT

β,β0
xT

1

=

1 X
n n  o
0
β

T 1
max 0, 1 − yi xi β − β0 + λ ∥β∥2

∥β∥
n i=1
β
xT

β
Minimize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
▶ λwhen
Indeed, for > 0 classes are linearly
small enough, separable,
the second term in the functional becomes negligible (although
recover problem with hard constraints
by taking λ > 0 small enough.
When classes are not linearly separable
such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.

n  o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints

(loss compared to a slab excluding observation xi )

Conversion to quadratic program:

0
=

=
0
(xi , 1)

0
β
▶ slack variables:


β
xT

β
n o

xT

1

T
Measures the ξloss
i :=on 0, 1 −
the i-th
max yi (xi β − β0 )
constraint

0
=
β
1


∥β∥
▶ substitution:

β
xT
β
n
X1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1

subject to:
∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
n o
is (∀i, optimum
because we canturns one inequality
optimize each ξ independently ⇝ ξ̂ithe
into an equalityfrom = max 1 − yiβ,
others0,once (xβT
− β̂0 )been
i 0β̂ have ) fix

When classes are not linearly separable


such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.

n  o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.


Interpretation:
1

0
=

=
0

Indeed, ▶ ˆi > 0: ) = ξ > 0. (xi , 1)


β

T
in ξthat case is on
xiwe 1 − yside
wrong
have i (xiof − β0boundary
β slab
0
β


β

the slab. Indeed, each equality yi (xTi β̂ − β̂0 ) = 1


xT

These are
▶ ξthe
ˆi =vectors that count to define
support vectors
xT

0:

=

• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary


0
β

1

∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side
β
xT

β
n
X1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1

subject to:
∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
n o
is (∀i, optimum
because we canturns one inequality
optimize each ξ independently ⇝ ξ̂ithe
into an equalityfrom = max 1 − yiβ,
others0,once (xβT
− β̂0 )been
i 0β̂ have ) fix
When classes are not linearly separable
such cases, the previous problem with hard constraints has no solution. ⇒ we must relax it.

n  o
T
Hinge
is zero forloss:
observations 1 − yon
max 0,lying i correct
x i β − side
β 0 of the slab (i.e. satisfying the previous constraints

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.


Interpretation:

0
=

=
0
Indeed, ▶ ˆi > 0: ) = ξ > 0. (xi , 1)

β
T
in ξthat case is on
xiwe 1 − yside
wrong
have i (xiof − β0boundary
β slab

0
β


β
that count to define the slab. Indeed, each equality yi (xTi β̂ − β̂0 ) = 1

xT

β
These are
▶ ξthe
ˆi =vectors

xT

1
0:


=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary

0
β
1


∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side

β
xT
β
n
X 1 ∝ loss
β̂0 , (ξˆi )n
is aβ̂,quadratic program
i=1 in the unknowns
:= argmin ξi β, , (ξi2)n
+ βλ0∥β∥ i=1 , with a positive-definite quadratic term
β,β0 ,(ξi )n n i=1 (xi , −1)
i=1

subject to: (select e.g. by cross-validation)

∀i definition
infringe the ξi ≥ 0 and (xTiremove
of ξiyito β − β0the
) ≥ max
1 − ξfrom
i the constraint. Specifically, we only ask
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.

▶ One-vs-all:

• train 1 classifier (β̂ y , β̂0y ) for each class y = 1, · · · , κ, to discriminate


labels are in class
this the binary problem,
(assigned label where
1) fromclass is seen
the yrest as data
of the the positive
(assigned label −1).
class

• assign
class whose corresponding classifier gives
each new observation d
x ∈ Rthetohighest score
the class xT bet
wins the
argmax β̂ y −forβ̂0yx. This requires
y=1,··· ,κ

▶ One-vs-one:
′ ′
Thus there train( κ21)classifier
• are (β̂ y,y , β̂0y,y
binary classifiers, i.e.) aforquadratic
each pairquantity in the
of classes y ′ ∈ {1,of
y ̸=number · · classes
· , κ},
to discriminate
Here, discrimination y from
is among y ′ in their jointspanned
the subpopulation subpopulations.
by the observations with labels y o

• given a new observation x ∈ Rd , decide between y and y ′ using


 
T y,y ′ y,y ′
sign x β̂ − β̂0 for each pair of classes y ̸= y ′ , then assign x
to the class with the highest number of positive answers.
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels

Kernel SVM

1.5 1.5

1 1

0.5 0.5

SVM
0
The linear classifier in the data space perfo
0

-0.5 -0.5

-1
-1

-1.5
-1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5


-1.5 -1
feature -0.5
space 0 0.5 1 1.5

Φ : x 7→ [ x x ] 2

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


0.5
1
become separable (hence the support vecto -0.5

3 -1

2
1 -1.5

SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Kernel SVM
Quadratic
present program
the hard margin (hard
case, the soft /margin
margin case is similar. Note also that soft margins
no slack):

argmin ∥β∥2 subject to yi (Φ(xi )T β − β0 ) ≥ 1 ∀i = 1, · · · , n


β,β0

(vector in feature space (inner product in feature space)

—may be infinite-dimensional)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


1
0.5 become separable (hence the support vecto
-0.5

3 -1

2
1 -1.5

SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

Kernel SVM
Quadratic
present program
the hard margin (hard
case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
n X n n
!
X X
merely substitute β αfor
argmin i ythe linear
i k(x yj αj subj. to
i , xj )combination of yΦ(x
i We
i )’s. α j yjthus , xj )a−new
k(xiget ≥ 1 ∀i p
β0 quadratic
α,β0
i=1 j=1 j=1

Pn Pn
Representer Thm =⇒ β̂ = i=1 αi yi Φ(xi ) = i=1 αi yi k(xi , ·)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


1
0.5 become separable (hence the support vecto
-0.5

3 -1

2
1 -1.5

SVM in 0
-1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

The Bayes classifier and its error rate can beclassifier


Bayes estimated accurately since we know the
error rate ≈ 21%

Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

linear regression Parameter


7-NN 7 has been selected using
k =classifier
error rate ≈ 34% error rate ≈ 22.5%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

λ’s priviledge the with


SVM λ = 10−2
constraints, hence small margin and number of λ
SVM with = 104 vectors. Large
support
error rate ≈ 28.8% error rate ≈ 30%

Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

dashed lines
SVMarewith
the slab’s
deg.-4boundaries,
polynomial as
kernel
before. The decision
SVMboundary
with Gaussian
of the kernel
Bayes classifier
rmance of theerror
Gaussian
rate ≈kernel
24.5%is particularly good here. This
errorisrate
explained
≈ 21.8%by the fact that
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

logistic
logistic
regression
reg. with
in feature
deg.-4space
polynomial
is regularized.
kernel λ and
logistic
the reg.
window
withsize
Gaussian
for thekernel
kernel have
again, the performance
error rate of the Gaussian kernel is particularlyerror
≈ 26.3% good,
ratedue to the fact that the
≈ 22.1%
What you should know
• Two types of linear approaches for classification

• Logistic regression:
- generalized linear model
- fitting by likelihood maximization & Newton-Raphson’s method,
convergence guarantees
- degenerate cases & regularization
- extension to multi-class

• Support Vector Machines (SVM):


- margin maximization & quadratic programming problem
- hinge loss, soft margin,
relaxed quadratic programming problem, support vectors
- extension to multi-class
- kernel SVM
Artificial Neural Networks

Outline:

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)


Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

roaches: symbolic (expert systems), connectionist (neural nets), statistical learning


Connectionist approach to machine learning
eal of the connectionist approach resides in the fact that new learning algorithms can

▶ Modeled after the human brain


▶ small units (neurons) learn simple functions
▶ networks of these units can learn complicated functions

x1

x2
y
···

xd−1

xd
Historical landmarks

Birth of connectionist approach:

[McCulloch, Pitts] [Rosenblatt] [Minsky, Papert]

neuron model perceptron algorithm perceptron cannot


to implement to train single learn XOR
logical gates logical neurons lack of training algorithm
via linear functions fori.e. networks
multi-layer of perceptrons
perceptrons

1943 1957 1969

Historical landmarks

General sentiment from the work of Minsky and


First AI winter:
ert: neural networks do not work for compli-
functions.

developments of backpropagation and stochastic


gradient descent in other contexts

1969 1986
Historical landmarks

Rebirth and expansion:

[Hinton] [Hinton et al.]

Boltzmann machines backpropagation for


with stochastic units neural networks
autoencoder architecture
recurrent neural networks (RNNs)
backpropagation through time

1985 1986

Historical landmarks

Rebirth and expansion:

[Cybenko] [Le Cun et al.] [Baldi, Hornik] [Anderson]

Universal convolutional neural autoencoders for neural networks for


Approximation networks (CNNs) unsupervised learning: reinforcement learning:
Theorem PCA, k-means solving the double
efficient handwritten
pendulum problem
digits recognition

1989 1989
Historical landmarks

Rebirth and expansion:

[Siegelmann, [Lin] [Bengio et al.] [Schmidhuber,


Sontag] Hochreiter]
Universal neural networks backpropagation ineffective Long Short-Term
Turing machine for reinforcement on networks with ‘many’
here, a Memory
few unitsnetworks
already is already
using RNNs learning tasks hidden layers (e.g. RNNs) (LSTMs) suffering
these networks
in robotics due to vanishing less than other
gradient issue RNNs from
vanishing gradient

1991 1993 1994 1997

Historical landmarks

General sentiment from the work of Bengio et al.:


Second AI winter:
networks with ‘many’ layers overfit the data
quickly become impossible to train.
Meanwhile, statistical analysis techniques (e.g.
random forests, SVMs, kernels) developed rapidly,
getting state-of-the-art results on practical tasks,
rapidmathematical
eing backed up by solid development of promising statistical learning
foun-
dations. techniques: random forests, SVMs, kernels

the (now small) neural networks community takes


refuge at the Canadian Institute for Advanced
Research (CIFAR)

1998 2006
Historical landmarks

‘Deep learning
the expression conspiracy’:
that the protagonists themselves employ to designate their strategy to put neural

[Hinton et al.] [Bengio et al.] [Bengio et al.] [Hinton et al.]


[Hinton et al.]
[Le Cun et al.]
fast greedy unsupervised Usefulness of this technique
Dropout technique consists
pre-training for the approach rectified linear
rithmically
weightsefficient technique
initialization allowed backpropagation
confirmed to avoid the vanishing gradient effect
units (ReLUs)
in backpropagation experimentally ReLUs are neurons whose activation
‘Deep networks’ termed

2006 2007 2009-2011 2012

Historical landmarks

Deep
slightly go backlearning
in time,and
frombig data:
2012 to 2009, for the sake of theme consistency

[Ng et al.] [Hinton et al.] [Hinton et al.]

training neural decade-long records AlexNet wins the Turing Award


networks on GPU in speech recognition ImageNet challenge ··· attributed to
broken using deep reducing error rates Bengio,
learning from 25% to 16% Hinton,
Le Cun

2009 2011 2012 2019

many more developments


(deep learning tsunami)
Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

Rosenblatt’s perceptron algorithm


▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0 0
=
0
β

β
xT

 h i
T −β0
classifier: y(x) := sign [ 1 x ]
associated classifier is the indicator function
β on the positive half-space bounded by the h
β0 0
∥β∥
Rosenblatt’s perceptron algorithm
▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0

▶ distances
Note: the fits the here
modelareby minimizing
absolute, the distances
not signed, hence theof
formula below.
the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified

{
at β,β0
∥β∥ × signed
choice of ∥β∥ is a free parameter (the hyperplane remains the same if ∥β∥
distance to hyperplane β and β0 are both multiplied

β0
 h i ∥β∥
T −β̂0
classifier: ŷ(x) := sign [ 1 x ]
ractice the classifier uses the estimated β̂parameters β, β0 .
0

Rosenblatt’s perceptron algorithm


▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0

▶ fits the model by minimizing the distances of


the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified
at β,β0

β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that

{
∂ X β0
∥β∥
=− yi x i
∂β
xi misclassified
at β,β0
we are standing at a particular position (β, β0 ), hence the set of 0misclassified points is fixed. Differentia
∂ X
= yi
∂β0
xi misclassified
at β,β0
Rosenblatt’s perceptron algorithm
▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0

▶ fits the model by minimizing the distances of


the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified
at β,β0

β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that
h i β0
init: set β̂0 at random ∥β∥
β̂
(gradient descent)
repeat:
compute misclassified set M 0
h i h i P  −y 
β̂0 β̂0
ϱ ∈ [0, 1] is the←−
step size (or
+ ϱlearning
xi ∈M yi xof
rate) i
i
the algorithm.
β̂ β̂

a threshold
until isconvergence
not needed if a//
separating
requireshyperplane actually
convergence exists (see below).
threshold

Rosenblatt’s perceptron algorithm


▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0

▶ fits the model by minimizing the distances of


the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified
at β,β0

β
is a ▶ piecewise
special linear
case of functional
piecewise smoothoptimization:
optimization. The ”piecewise” here comes from the fact that
h i β0
init: set β̂0 at random (stochastic)
”stochastic gradient descent”
∥β∥ is also called ”iterative
β̂
(gradient descent)
repeat:
Foreach i = 1, · · · , n do 0
  h i h i  −y  h i
β̂0 β̂0
difference If
with
yi theT β̂ − β̂0 gradient
xi classical < 0 thendescent ←−
is that a step
+ ϱ isytaken
i
i xi
(and thus M β̂is0 updated)
// update and M after
β̂ β̂ β̂

until convergence // requires convergence threshold (in principle)


Rosenblatt’s perceptron algorithm
▶ designed for binary classification (Y = {−1, 1})
h i
▶ SVM
s like models
butthethe
separating
objective hyperplane
function is directly:
different [1 x] T −β0
β =0

▶ fits the model by minimizing the distances of


the misclassified points to the decision boundary:
X h i
β̂, β̂0 :=point
misclassified xi , the quantities y−
argmin yi [ 1β0xi+]TxTi β−β
i and
0
β(note the + here, which occurs in the exp
β,β0
xi misclassified
at β,β0

β
Thm:itself
erceptron [Rosenblatt 1960] [Novikoff
was proposed 1962] in 1957. The convergence in finitely many steps
by Rosenblatt
0 β
• If the two classes are linearly separable, then ∥β∥
Converging stochastic gradient
of the energy descentthat
to 0 means with 1 makes the
ϱ =separating
some hyperplane is found.
energy converge to 0 in finitely many steps.
0
• More precisely, if ∃ separating hyperplane with
the smaller
marginthe optimal
γ and if ∥xmargin, the longer time it takes for the algorithm to converge.
i ∥ ≤ R ∀i = 1, · · · , n, then
convergence occurs after O(R2 /γ 2 ) steps.

Rosenblatt’s perceptron algorithm


Advantages:

a sum▶oflinear structure
terms, of depending
each one energy permits
on a the of stochastic gradient descent
useobservation
single
▶ stochastic and
input observations gradient arescales
descent
responses up well
processed and allows
in sequence. forneed
- No re-training
to restart the training

Despite these drawbacks, the approach remains highly appealing thanks to the stochastic gradient
Drawbacks:

This is in ▶ no unique
contrast solution
to SVM, ⇝ solution
which provides depends onmaximizing
separators initializationthe margin. In the example

▶ small margins lead to long convergence times

▶ algorithm behaves badly on non-separable classes:

- convergence
is in contrast to SVM, to irrelevant
which configurations
provide separators with best margin even in this case.

- cyclic behavior (with potentially long cycles)


Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

Connectionist viewpoint on the perceptron


Perceptron neuron:
input ∈ R weight ∈ R

case of the perceptron, x


the activationis function:
1 activation function R → step
the Heaviside R (Heaviside)
function, which is non-linea
β1

x2 β2 " x1 #

.. β is
y= signx x=T β −
Here, 0 the vector obtained by collating
.
···

xd
{

β0
xd−1 βd−1

βd activation (linear)
xd bias ∈ R
Connectionist viewpoint on the perceptron

▶ In practice, smooth or piecewise smooth activation functions are used:

identity: t 7→ t ⇝ neuron implements linear regression

ReLU: t 7→ max{0, t} ⇝ simpler gradient expressions, faster training


helps address the vanishing gradient effect

Gaussian: t 7→ exp(−t2 ) ⇝ neuron implements linear regression with rbf

1
sigmoid: t 7→ ⇝ neuron implements logistic regression
1 + exp(−t)

exp(xi )
softmaxi : (x1 , · · · xd ) 7→ P ⇝ produces outputs in [0, 1]
j exp(xj )
outputs sum up to 1

Multi-layer perceptron (MLP)


is a mis-nommer: - the perceptron is initially a neuron, not a network → ”multi-layer perceptron

Feedforward,
eedforward: full connectivity
signal moves from left to between consecutive
right (arrow layers:
heads are omitted in the picture to avoid

x1 1
0
σ 1,1 σ 1,s
β01,1
··· β 1,s
0
d
thex2coordinates of the input observation x ∈ R (and not a collection of d ϕ
1 observations).
1 y1
0
σ 2,1 σ 2,s
β02,1
··· β02,s The yj ’s are the
···
···

···
···

ϕκ yκ
σ r1 ,1 σ rs ,s
β0r1 ,1
··· β0rs ,s

xd 1
0

neurons in the input layer are mere identity functions : all they do is forward their (unique)
input layer Hiddens means that neurons
hidden layers are per
(ri neurons notlayer
connected
i) to
The input
output nor
layeroutput
neurons va
in the output

feed forward: evaluate layer per layer in a forward pass


Multi-layer perceptron (MLP)
is a mis-nommer: - the perceptron is initially a neuron, not a network → ”multi-layer perceptron

Feedforward, full connectivity between consecutive layers: vanilla MLP for


The hidden layers typically
κ-class classification
x1 1
0

β01,1
··· β01,s
x2 1 The softmax functions
smax1 ensure
y1
0

β02,1
··· β02,s

···
···

···
···

smaxκ yκ
β0r1 ,1
··· β0rs ,s

xd 1
0

input layer s hidden layers (ri neurons per layer i) output layer

feed forward: evaluate layer per layer in a forward pass

Multi-layer perceptron (MLP)


is a mis-nommer: - the perceptron is initially a neuron, not a network → ”multi-layer perceptron

Feedforward, full connectivity between consecutive layers: vanilla MLP for


regression
x1 1
0

β01,1
··· β01,s
x2 1
0

β02,1
··· β02,s

0
y
···

···
···

β0r1 ,1
··· β0rs ,s

xd 1
0

input layer s hidden layers (ri neurons per layer i) output layer

feed forward: evaluate layer per layer in a forward pass


Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

Approximation power
The 1-hidden layer case: r
X γj
y(x) = 
1 + exp β0j − xT β j

{
1
0
j=1

β1 β01
1
0

β2 γ1
β02

x γ2
0
y
γr
···
···

βr
β0r

InThm (Universal
words, this resultApproximation): [Cybenko 1989]
means that the functions y(x) pro
1
0 For any continuous function f : Rd → R with
compact support X, and any ε > 0, there exist
a hidden layer size r ∈ N and parameter values
β0j , β j , γj for 1 ≤ j ≤ r such that

|f (x) − y(x)| ≤ ε ∀x ∈ X.
Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

Training
Input: (x1 , y1 ), · · · , (xn , yn )
Xr  
T j j
γj σwe
y(x) = Here j x β −
consider β
general
0 activation

{
1
0
j=1
σ1
β1 β01
1 arbitrary activation function
0

β2
σ2
γ1
β02

x γ2
0
y
γr
···
···

βr σr
β0r

1
0
Training
Input: (x1 , y1 ), · · · , (xn , yn ) σ 
(xT β 1 −β01 ) T
1

..
y(x) =  We express  sum
the γ as a dot p
.

{
1
0 σr (xT β r −β0r
)
σ1
β1 β01
1
0

β2
σ2
β02

x γ 0
y
···
···

n
X
βr σr RSS = (yi − y(xi ))2

{
β0r i=1
This
Ri is the con
σ T 1 1 
1 (
1 xi β −β0 )
0
 ..  impulse from
∇γ R i = −2 (y i − y(x i ))
gradient’s formula is of the same form at each neuron: ∇ = error. · impulse The hidden
previous neurons
layer
error at σr ( x T r r
i β −β0 )
The output
current per se can be seen as the next layer of th
neuron
 
′ T j j
∇β j Ri = −2 (yi − y(xi )) γj σj xi β − β0 xi

Training
Input: (x1 , y1 ), · · · , (xn , yn ) σ 
(xT β 1 −β01 ) T
1

..
y(x) =  We express  sum
the γ as a dot p
.

{
1
0 σr (xT β r −β0r
)
σ1
β1 β01
1
0

β2
σ2
β02

x γ 0
y
···
···

n
X
βr σr RSS = (yi − y(xi ))2
{
β0r i=1

weight= 1
This
Ri is the con
σ T 1 1 
1
id′ = 1 (
1 xi β −β0 )
0
∇γ Ri = −2 (yi − y(xi ))  .. 
.
σr ( x T r r
i β −β0 )
back-propagated
This errorsign. The reason is that here we chose β
is up to a minus  
′ T j j
∇β j Ri =the
is computed by back-propagating −2 error
(yi −from
y(xi the
)) γjnext − β0 xi by the weight of
xi β weighted
σj layer,
weight of j-th hidden neuron in output neuron
Training
Input: (x1 , y1 ), · · · , (xn , yn ) gradient at each neuron:
∇β Ri = err · z

back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs  
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···

σ
···
βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Training(”stochastic”
randomness epoch (stoch. gradient
term) comespass):
from the fact βthat
β ←− − ϱthe
∇β Rorder
i
in which the training set
learning rate
▶ sweep through the training set The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i

▶ apply forward-backward update


for each (xi , yi ) ▶Attention:
Guarantee:
thiscomputes
is under the
the correct gradient
assumption that step
the set

Training
Input: (x1 , y1 ), · · · , (xn , yn ) gradient at each neuron:
∇β Ri = err · z

back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs  
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···

σ
···

βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Online learning: β ←− β − ϱ ∇β Ri
learning rate
▶ perform multiple training epochs The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i

▶ update (reduce) ϱ between epochs


until cvgence or early stop ▶Here,
Guarantee:
we meancvges to a localinmin.
convergence under(even
all cases proper
when no
reduction scheme for ϱ
Training
Input: (x1 , y1 ), · · · , (xn , yn ) gradient at each neuron:
∇β Ri = err · z

back-propagation
In the general feed-forward case we equation:
obtain this equation, which
z1 γ1 !
Xs  
′ T
err1 err = γj · errj σ z β − β0
j=1
β1
···

σ
···
βr β0
err Forward-backward procedure for each (xi , yi ):
γs
zr ▶ forward: compute activations & impulses
errs ▶ backward: back-propagate error and update
Online learning: β ←− β − ϱ ∇β Ri
learning rate
▶ scales up well The equation
β0 ←− ∇β0∇
β0 − ϱfor R j R was not given on the
β0i i

▶ can handle new training data


Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

Regularization
The high number of parameters in neural networks usually leads to overfitting.
error

test

training
# training epochs
Regularization
The high number of parameters in neural networks usually leads to overfitting.

error

test

training
# training epochs

Approach 1: early stop:

▶ stop gradient descent after k epochs


▶ select k by cross-validation

Regularization
The high number of parameters in neural networks usually leads to overfitting.

···

···
···

···
···

···

Approach 2: dropout:

▶ at each
The neurons are indeed
trainingcompletely switched
epoch, randomly off:offtheir
switch activations
a fraction and gradients
of neurons are set
in each layer
▶ replaces the full model by a series of random simplified models
▶ select fraction of switched-off neurons by cross-validation
Regularization
The high number of parameters in neural networks usually leads to overfitting.

···

···

···

···
···

···

Approach 3: weight decay:


X
▶ penalize
is pretty 2
ℓ -norm
much Ridge of the parameter
regularization vector:network
in the neural RSS +context.
λ It(β j 2
favors ∥β j ∥22weights
0 ) +small
neuron j
▶ adds a term 2λ β j to the gradient ∇β j Ri
▶ λ selected by cross-validation

Regularization
The high number of parameters in neural networks usually leads to overfitting.

···

···
···

···
···

···

Approach 3’: weight elimination:


X (β0j )2 + ∥β j ∥22
▶ penalize
penalty encourages2
ℓ -norm
a few of
large
the weights:
parameteronce
vector:
a weight
RSSbecomes
+λ large enough,j 2
its penalty
neuron j
1 + (β 0 ) + ∥β j ∥22
▶ shrinks smaller weights more drastically
▶ λ selected by cross-validation
Regularization
Example: n = 100 + 100 (mixture of 2 Gaussians), d = 2

As in the previous lectures, the decision b

hidden units are located in units


10 hidden a single hidden layer.
10 hidden units + weight decay (λ = 0.02)
err. rate ≈ 25.9% err. rate ≈ 22.3%

predictor Bayes classif. 7-NN Asker


Gauss. a reminder,
logistic reg.we Gauss.
recall here the error rates
ker SVM

error rate 21% 22.5% 22.1% 21.8%


Outline

• Connectionist machine learning: principles & historical landmarks

• Rosenblatt’s perceptron algorithm

• Connectionist view on perceptron, multi-layer perceptron (MLP)

• Approximation power of MLP

• Training of MLP, gradient back-propagation in neural networks

• Regularization of neural networks

• Convolutional neural networks (CNN)

From MLP to convolutional networks


ZIP code dataset:

▶ 16 × 16 binary patches extracted from handwritten digits on envelopes in US

▶ dataset down-sized to 320 images for training and 160 for testing
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units

The size of the output layer is prescribed by the number


1 × 10

16 × 16

From MLP to convolutional networks


Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units

▶ Net-2: 1 hidden layer with 12 sigmoid units, full connectivity

1 × 10
1 × 12
16 × 16
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units

▶ Net-2: 1 hidden layer with 12 sigmoid units, full connectivity

▶ Net-3: 2 hidden layers, local connectivity


• 3 × 3 patches for first hidden layer
• 5 × 5 patches for second hidden layer
• full connectivity at output layer

1 × 10

16 × 16 8×8 4×4

From MLP to convolutional networks


Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units

▶ Net-2: 1 hidden layer with 12 sigmoid units, full connectivity

▶ Net-3: 2 hidden layers, local connectivity

▶ Net-4/5: 2 hidden layers, local connectivity, shared weights


Each frame
withiniseach
marked as a
frame
The(convolutional layer)the
name comes from

8×8

1 × 10

16 × 16 8×8 4×4
From MLP to convolutional networks
Proposed networks [Le Cun 1989]:
▶ Net-1: no hidden layer, full connectivity with sigmoid output units

▶ Net-2: 1 hidden layer with 12 sigmoid units, full connectivity

▶ Net-3: 2 hidden layers, local connectivity

▶ Net-4/5: 2 hidden layers, local connectivity, shared weights


Each frame
withiniseach
marked as a
frame
The(convolutional layer)the
name comes from

- convolution allows to reduce the global number of parametersnetwork


- thus, it accuracy
reduces the risk
Net-1 80%
accuracy

Net-2 87%
Reminder: accuracy = success rate Net-3 88.5%
Net-4 94%
Net-5 98.4%

# training epochs
What you should know

• Concepts: artificial neuron & neural network, deep learning, AI winter

• Rosenblatt’s perceptron: linear model, stochastic gradient descent,


Rosenblatt-Novikoff thm, limitations

• Multi-layer perceptron (MLP): architecture, universal approximation thm

• Gradient back-propagation: general formula, special case of MLP

• Regularization: early stop, dropout, weight decay & elimination

• Convolutional neural network: principles


Feature Extraction

Outline:

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)


Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Feature extraction in machine learning

Rn
Data Features ∈
 
feature design
or learning

···

Build features for


▶ ML on non-vectorial data
▶ improved learning rates
Feature extraction in machine learning

revious lectures
Input: we talked
data space about
D (can d
be Rkernels
, spacefor data sitting
of graphs, of 3d in Rd , e.g.
shapes, etc.)the Gaussian kernel.

Goal: find a mapping Φ : D → Rk (vectorization) or Φ : D → H (kernel) that:

▶ is meaningful, preserving relationships (e.g. proximity) between data in D

▶ extracts structural information that is useful for learning tasks

▶ can be inverted (pre-image problem)

2 classes of approaches:

▶ feature engineering: new representation is designed using expert knowledge

▶ feature learning: new representation is learnt from the data

Feature extraction in machine learning

revious lectures
Input: we talked
data space about
D (can d
be Rkernels
, spacefor data sitting
of graphs, of 3d in Rd , e.g.
shapes, etc.)the Gaussian kernel.

Goal: find a mapping Φ : D → Rk (vectorization) or Φ : D → H (kernel) that:

▶ is meaningful, preserving relationships (e.g. proximity) between data in D

▶ extracts structural information that is useful for learning tasks

▶ can be inverted (pre-image problem)

An old, vast, rich, and (still) hot topic:

▶ area-specific

▶ deserves an entire course

▶ overview:
overview is intended1-2
to approaches
help the students gettype
per data started
(cf. for their upcoming
specialized projects or internships.
3A courses)
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Text features

Input: text data on a fixed dictionary T (e.g. English, French, scientific)

Bag-of-words model:

▶ distribution (histogram) X ∈ NT of word occurrences in the text

▶ normalized distribution X ′ ∈ RT to handle texts of different sizes

▶ ”bag” means that the order of words is ignored

▶ remove
These unbalace stop-words
the distribution

▶ use base forms of words (lemmatization)

Example:
”Humans come down from the apes, the ape comes down from the tree.”

ape: 1 2 come: 1 2 down: 2 humans: 1 tree: 1


apes: 1 comes: 1 from: 2 the: 3
Text features

Input: text data on a fixed dictionary T (e.g. English, French, scientific)

Bag-of-words model:

▶ distribution (histogram) X ∈ NT of word occurrences in the text

▶ normalized distribution X ′ ∈ RT to handle texts of different sizes

▶ ”bag” means that the order of words is ignored

▶ remove
These unbalace stop-words
the distribution

▶ use base forms of words (lemmatization)

Typical application: spam filtering (email ≡ bag of keywords)

Generalization: n-gram model (sequences of words of length n)

theComplement:
context are theword2vec (neural
neighboring netintrained
words to predict
the input text. Inword
fact from context
word2vec usesBoW)
a ”continuous
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Graph features

Input: graph (network) data, represented via adjacency matrices A ∈ Rn×n

▶ reprensentation not invariant to vertex relabeling (unweighted: Aij ∈ {0, 1})

(undirected: A = AT )

2 4 "0 1 1 0 0#
Example: (n = 5) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
3 5 0 0 0 1 0
Graph features

Input: graph (network) data, represented via adjacency matrices A ∈ Rn×n

Graphlets:

▶ dictionary T of unlabeled graphs of fixed vertex size k (graphlets)

ortant ▶tocount
note that
the number
the graphlet
of occurrences
must be embedded
of each graphlet
as a subgraph
as an induced
inducedsubgraph
by its vertices,

▶ spectrum X ∈ NT : distribution of occurrences (bag-of-features model)

2 4 "0 1 1 0 0#
Example: (n = 5, k = 3) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
X = (1, 3, 6, 0) 3 5 0 0 0 1 0

Principle: - we look at: {1,


all 2,
1 4}, {3,
size-k subsets : {2,
2, 4},(and
3 4,not
5} tuples, since
: 6we want to be labeling-independent)
: 0

Graph features

Input: graph (network) data, represented via adjacency matrices A ∈ Rn×n

Graphlets:

▶ dictionary T of unlabeled graphs of fixed vertex size k (graphlets)

ortant ▶tocount
note that
the number
the graphlet
of occurrences
must be embedded
of each graphlet
as a subgraph
as an induced
inducedsubgraph
by its vertices,

▶ spectrum X ∈ NT : distribution of occurrences (bag-of-features model)

▶ normalized spectrum X ′ ∈ RT to handle graphs of different vertex sizes n

2 4 "0 1 1 0 0#
Example: (n = 5, k = 3) 1 1 0 1 1 0
A= 1 1 0 0 0
0 1 0 0 1
X ′ = ( 10
1
, 3
, 6 , 0) 3 5 0 0 0 1 0
10 10

Principle: - we look at: {1,


all 2,
1 4}, {3,
size-k subsets : {2,
2, 4},(and
3 4,not
5} tuples, since
: 6we want to be labeling-independent)
: 0
Graph features

Input: graph (network) data, represented via adjacency matrices A ∈ Rn×n

Graphlets:

▶ dictionary T of unlabeled graphs of fixed vertex size k (graphlets)

ortant ▶tocount
note that the graphlet
the number must be embedded
of occurrences as a subgraph
of each graphlet inducedsubgraph
as an induced by its vertices,

▶ spectrum X ∈ NT : distribution of occurrences (bag-of-features model)

▶ normalized spectrum X ′ ∈ RT to handle graphs of different vertex sizes n

Props:

▶ G1 ≃ G2 ⇒ X ′ (G1 ) = X ′ (G2 )

▶ the converse
in general, we do loseholds
someforinformation ≤ the
n = k + 1 in 11 but not in general
process.

Graph features

Input: graph (network) data, represented via adjacency matrices A ∈ Rn×n

Graphlets:

▶ dictionary T of unlabeled graphs of fixed vertex size k (graphlets)

ortant ▶tocount
note that the graphlet
the number must be embedded
of occurrences as a subgraph
of each graphlet inducedsubgraph
as an induced by its vertices,

▶ spectrum X ∈ NT : distribution of occurrences (bag-of-features model)

▶ normalized spectrum X ′ ∈ RT to handle graphs of different vertex sizes n

Computation:

▶ exhaustive enumeration of size-k subgraphs in size-n graph takes O(nk ) time

▶ sample the space of graphlets to speed up the calculations in practice


Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Image features

Input: Images via intensity maps I : Z2 → R, one for each color channel

▶ reprensentation not invariant to rigid transforms and rescaling

▶ local features designed to characterize the ‘local shape’ of the image

▶ used for salient point detection & matching


(e.g. stereoscopic vision, panorama building)

▶ can be combined into global


features for image comparison
Image features

Input: Images via intensity maps I : Z2 → R, one for each color channel

SIFT (Scale-Invariant Feature Transform) at pixel x ∈ Z2 :

step a) choose
makes the scale
featureσ:scale-invariant (as the name says)
▶ compute
for instance bandwidths rangingat
convolutions over a logarithmic
x with Gaussian scale
kernels of various bandwidths σi
▶ compute
these differences the differences
approximate between
the gradient of thethe convolutions
convolution at bandwidths
w.r.t. the scale parameter
σi , σi+1 σ
difference is▶integrated over the
select scale(s) mask’s
σ with domaindifference
maximum

scale
Gaussian masks x

Image features

Input: Images via intensity maps I : Z2 → R, one for each color channel

SIFT (Scale-Invariant Feature Transform) at pixel x ∈ Z2 :

step b) choose
makes the orientation:
feature rotation-invariant (desirable property as well)
▶ compute intensity gradient at each pixel y in a window of size ∝ σ around x
▶ build histogram of gradient directions (36 bins, 10 degrees each)
▶ assign orientation corresponding to highest peak in histogram
▶ rotate image so assigned orientation is vertical
y

x
Image features

Input: Images via intensity maps I : Z2 → R, one for each color channel

SIFT (Scale-Invariant Feature Transform) at pixel x ∈ Z2 :

c) compute feature:
▶ subdivide
size of the window is16 × 16a window
fixed around
priori, i.e. x into 16
independent the scaleofσsize 4 × 4
of patches
▶ compute histogram of gradient orientations (8 bins) in each 4 × 4 patch
▶ collect the 8 × 4 × 4 = 128 values (weighted by Gaussian at x) into a vector

Image features

Input: Images via intensity maps I : Z2 → R, one for each color channel

SIFT (Scale-Invariant Feature Transform) at pixel x ∈ Z2 :

c) compute feature:
▶ subdivide
size of the window is16 × 16a window
fixed around
priori, i.e. x into 16
independent the scaleofσsize 4 × 4
of patches
▶ compute histogram of gradient orientations (8 bins) in each 4 × 4 patch
▶ collect the 8 × 4 × 4 = 128 values (weighted by Gaussian at x) into a vector

From local to global features:


▶ sample the image domain uniformly
▶ compute SIFT feature at each sample
leads to a▶potentially
concatenateveryfeature vectors orglobal
high-dimensional collect themvector
feature in a set (more on this later)
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

3d shape features

Input: 3d shapes via (triangular) meshes

▶ non-canonical reprensentation

▶ local features designed to characterize the shape’s local geometry

▶ used for salient point detection

▶ can be combined into global features


for e.g. shape comparison and matching
3d shape features

Input: 3d shapes via (triangular) meshes

Spin image at vertex p on a triangulated shape X:

▶ estimate tangent plane P and oriented unitary normal ⃗n at p,


▶ compute cylindric coordinates (r, θ, h) for every other vertex of X
▶ keep only those vertices with small distance & normal deviation from p
▶ record histogram of the distribution of radii & heights (r, h)
(invariant to spin)

h
h θ X
r

3d shape features

Input: 3d shapes via (triangular) meshes

Spin image at vertex p on a triangulated shape X:

▶ estimate tangent plane P and oriented unitary normal ⃗n at p,


▶ compute cylindric coordinates (r, θ, h) for every other vertex of X
▶ keep only those vertices with small distance & normal deviation from p
▶ record histogram of the distribution of radii & heights (r, h)
(invariant to spin)

From local to global features:


▶ sample the shape uniformly
▶ compute spin image at each sample
▶ collect the spin images in a set (more on this later)
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Time series features

Input: time series f : N → Rd

▶ may
behavior in this
becontext
chaotic,comes
irregularly
from sampled,
incommensurate
multivariate,
frequencies.
hard to -realign,
irregularetc.
sampling mak
features have been proposed, notably for periodic time series: coefficients of Fourier or wavelet
Time series features

Input: time series f : N → Rd

Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f

(time-delay
embedding)

 
f (t) τ : step / delay
f (t+τ )
 
TDm,τ (f, t) :=  ..  mτ : window size
.
f (t+mτ )
m + 1: embedding dimension

Then our ▶
time
point
series
cloud in Rm+1
becomes a regular
(time ispoint
forgotten
cloud about)
in Euclidean space, where each point (obs

Time series features

Input: time series f : N → Rd

Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f

(time-delay
embedding)

signal embedded data

periodicity circularity

comes (remotely)
max. frequency from
(ν) Shannon’s
min.theorem.
ambient The intuition is that, the larger the frequency
dimension
(m ≥ 2ν)

# non-commensurate freq. intrinsic dimension


(S1 × · · · × S1 )
Time series features

Input: time series f : N → Rd

Time-delay
formulas embedding
given here assume (a.k.a.
d = 1. sliding-window
For higher values
embedding):
of d, vectors for each coordinate are
window
Rm+1
TDm,τ
f

(time-delay
embedding)

Motivation from dynamical systems:


Thm: [Nash, Takens]
X Given a Riemannian manifold X of dimension m 2
,
ϕ it is a generic property of ϕ ∈ Diff 2 (X) and
x C 2 (X, R)mapped
implies for instance that any periodic orbit αin∈(X, ϕ) is that to a periodic orbit through
X → Rm+1
α
x 7→ (α(x), α ◦ ϕ(x), · · · , α ◦ ϕm (x))
fx (n) := α(ϕn (x))
R is an embedding.
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Curse of dimensionality

Most features live in high to very high dimensions:


▶ size of dictionary for text
▶ number of graphlets of a given size for graphs
▶ 128 for SIFT features, multiplied by # sampled pixels
▶ area of neighborhood for spin images, multiplied by # sampled vertices
▶ sliding window size over stepsize for time series

Curse of dimensionality: set of phenomena occurring in high dimensions,


which are nefastous to the analysis of data.

▶ poor algorithmic performances of NN-search algorithms


▶ poor convergence rates of nonparametric density estimators
▶ poor learning rates of k-NN predictor
▶ etc.
Curse of dimensionality

Most features live in high to very high dimensions:


▶ size of dictionary for text
▶ number of graphlets of a given size for graphs
▶ 128 for SIFT features, multiplied by # sampled pixels
▶ area of neighborhood for spin images, multiplied by # sampled vertices
▶ sliding window size over stepsize for time series

Curse of dimensionality: set of phenomena occurring in high dimensions,


which are nefastous to the analysis of data.

▶ workaround: dimensionality reduction

Dimensionality reduction

Hypothesis: data can be described by a small set of intrinsic (latent) variables.

Rd
Example: set of 4096-dime
data points, representing
els images of a same object,
der various lighting an
gles. (from Isomap, Science

Rk
Dimensionality reduction

Hypothesis: data can be described by a small set of intrinsic (latent) variables.

Contributions of dimensionality reduction:

▶ prevents the curse of dimensionality

▶ identifies the latent variables

▶ resolves degenerate cases (cf. linear regression with dependent variables)

Dimensionality reduction

Hypothesis: data can be described by a small set of intrinsic (latent) variables.

This isAawealth
very old and rich topic, so here I am only listing a few popular approaches
of approaches:

▶ linear
NMF stands for non-negative matrix factorization
(PCA, MDS, NMF)

▶ non-linear (kernel-PCA, Isomap, t-SNE, UMAP)

▶ clustering based (vector quantization)

▶ neural network based (autoencoders)


Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Vector quantization

Hypothesis: data gather into a small number of clusters.

▶ compute clusters Ci , · · · , Ck and cluster centers c1 , · · · ck

▶ compute coordinates d1 (xi ), · · · , dk (xi ) w.r.t. cluster centers


for each data point xi

▶ map data points to Rk via x 7→ [ d1 (x) ··· dk (x) ]


T

Computing the Cj ’s, cj ’s and dj (xi )’s:

▶ k-means:
dj (xi ) = 1xi ∈Xj

▶ Gaussian mixture models:


for each xi , proba. distrib. on the Ci ’s
Vector quantization

Hypothesis: data gather into a small number of clusters.

▶ compute clusters Ci , · · · , Ck and cluster centers c1 , · · · ck


codebook
▶ compute coordinates d1 (xi ), · · · , dk (xi ) w.r.t. cluster centers
for each data point xi code

▶ map data points to Rk via x 7→ [ d1 (x) ··· dk (x) ]


T

featureApplications
extraction,ineach datum
feature x is one feature extracted from one observation, and it
extraction:

single▶feature
reduce extracted
dimensionality
fromofanfeature space this feature can be mapped to the vecto
observation,
▶ encode
can also set of features
do something as a distribution
more subtle: when several different features are extracted from
over the codebook (bag-of-features)
▷ pooling of word functions dj (·)

Vector quantization

Example with image data (same idea applies to 3d shapes):

P
k-means quantization the word
SIFT their values in {0, 1}, so sum-pooling simply
functions+take-pooling
+ k-means
# occurrences

codewords
Outline

• Principles of feature extraction

• Handcrafted features for:


- text data
- graph data
- image data
- 3d shapes data
- time series data

• Curse of dimensionality and dimensionality reduction

• Vector quantization

• Principal component analysis (PCA)

Linear dimensionality reduction

Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
 T
x1

Input: X =  ..  ∈ Rn×d , k ∈ N
.
xT
n

Goal: find H minimizing the residual variance: the circles mark


n
1 X
H := argmin ∥xi − πE (xi )∥22
dim E=k n
i=1
ortho. proj. onto E
Linear dimensionality reduction

Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
 T
x1

Input: X =  ..  ∈ Rn×d , k ∈ N
.
xT
n

Goal: find H minimizing the residual variance: the circles mark


n
1 X
H := argmin ∥xi − πE (xi )∥22
dim E=k n
i=1

1
Pn
Prop: H contains the centroid x̄ := n i=1 xi

proof:
n n
1 X 2 1 X x̄
∥xi − πtheorem
is by Pythagoras’ E (xi )∥2 = ∥xi − πE (x̄)∥22 − ∥πE (x̄) − πE (xi )∥22
n i=1 n i=1 xi

{
{
inv. under trans. of E
1 X
n
∥xi − x̄∥
develop the square, 2
then ∥x̄ − πthat
2 +notice the
E (x̄)∥
{
2
2 cross-term (dot
min. when product)
x̄ ∈ E
E
vanishes thanks to the fact that x
n i=1 □

Linear dimensionality reduction

Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
 T
x1

Input: X =  ..  ∈ Rn×d , k ∈ N
.
xT β1
n

Goal: find H minimizing the residual variance: the circles mark


n
1 X
H := argmin ∥xi − πE (xi )∥22
dim E=k n
i=1

Hyp: dataset is centered (x̄ = 0): Data centering: ∀i, xi 7−→ xi − x̄


 T

 x β1 = 0

H: ··· where (β1 , · · · , βd−k ) is orthonormal


 T
x βd−k = 0
Linear dimensionality reduction

Hypothesis: data lie on (or close to) some k-dimensional affine subspace.
 T
x1

Input: X =  ..  ∈ Rn×d , k ∈ N
.
xT β1
n

Goal: find H minimizing the residual variance: the circles mark


n
1 X
H := argmin ∥xi − πE (xi )∥22
dim E=k n
i=1

Hyp: dataset is centered (x̄ = 0): Data centering: ∀i, xi 7−→ xi − x̄


 
H : xT B = 0 where B = [ β1 ··· βd−k 0 ··· 0 ] ∈ Rd×d s.t.
So B T
B = BId−k
basically, 0
is0 orthogonal
0
on

▶ ideally, B is such that X B = 0


Pn 
▶ in practice,
we search find B̂ minimizing
for a least-squares ∥Xfunctional
solution. The B∥F to
2
i=1 ∥xis
= optimize − πHthe
iThus,
equal (x )∥22 residualnorm
Frobenius
to ithe equals
variance,
P
= i,j (X B)2ij

Resolution of least-squares problem


2
is the same
B̂ := objective
argmin ∥X B∥ function
F as in linear regression (with the response vector equal
B

I 0

where B = [ β1 ··· βd−k 0 ··· 0 ] ∈ Rd×d s.t. BT B = d−k
0 0

I 
▶ B =as
expressed d−k 0
Wa full-rank
0 0
where WT W
orthogonal matrix
= Idright-composed with an orthogonal projection

▶ Singular value decomposition (SVD) of X:


 T
 U U = In



 T

 V V = Id
λ 
X = UT D V where 1 0

 ..

  assume  that
Here we  ∈ R n ≥s.t.
wlog d. 0We
≤ λcan
1 ≤always
· · · ≤ λreduce the
n×d

 D =  . d
 0 λ d
0

I  2
=⇒ ∥X B∥2F = UT D V W d−k
0
0
0 F
Indeed, this
is minimum whenchoice
W =places
VT . the lowest eigenvalu
Resolution of least-squares problem
I 

{ H = ker VT d−k 0
I  0 0
B̂ := argmin ∥X B∥2F = VT d−k 0
=⇒
B
0 0 T 0 0 
πH (x 1 , · ·is· the
This , xn )matrix
= XV of the0 Icoordinates
k

Geometric interpretation:

▶ VT aligns the frame with the principal directions of the covariance matrix
0 
▶ 0
0 Ik projects onto the principal directions of largest eigenvalues (variances)

β1

h i
0 0
× VT × 0 Ik

Experimental result

Dataset:
variables corresponding to runtimes
1988 Olympics decathlonhave
(10 been negated,
variables, so that performance increases with
34 observations)
Experimental result

Dataset: 1988 Olympics decathlon (10 variables, 34 observations)


this is global
what closer inspection on the intrinsic
performance

runners
throwers

ectrum suggests
Spectrum
that most
of SVD
of the
diagonal
variance
matrix
should
D be explained embedding
by the firstinto Rk with
2 intrinsic variables
k=2
(ordered by decreasing value)
What you should know

• Concepts: feature extraction / engineering / learning

• Examples of features for text and graph data

• Curse of dimensionality and principles of dimensionality reduction

• Vector quantization

• PCA: linear model, least squares problem, resolution by SVD

You might also like