0% found this document useful (0 votes)
7 views

Supervised Classification 3601

This document provides an introduction and overview of supervised classification and machine learning techniques, including: 1) It discusses the supervised learning framework, including training data, predictors, loss functions, and risk. 2) It introduces Bayes classifiers and plug-in classifiers, explaining how they aim to estimate the conditional distribution of the output given the input to make predictions. 3) It specifically describes Naive Bayes classification, which makes a strong independence assumption between features to allow simple modeling, even in high dimensions. 4) It outlines discussing additional techniques like discriminant analysis, support vector machines, and neural networks that will be covered.

Uploaded by

wanjacquot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Supervised Classification 3601

This document provides an introduction and overview of supervised classification and machine learning techniques, including: 1) It discusses the supervised learning framework, including training data, predictors, loss functions, and risk. 2) It introduces Bayes classifiers and plug-in classifiers, explaining how they aim to estimate the conditional distribution of the output given the input to make predictions. 3) It specifically describes Naive Bayes classification, which makes a strong independence assumption between features to allow simple modeling, even in high dimensions. 4) It outlines discussing additional techniques like discriminant analysis, support vector machines, and neural networks that will be covered.

Uploaded by

wanjacquot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

MAT3601

Introduction to data analysis

Supervised classification
Bayes classifier and discriminant analysis

1/39
Machine Learning

2/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

3/39
Supervised Learning

Supervised Learning Framework


+ Input measurement X ∈ X (often X ⊂ Rd ), output measurement Y ∈ Y.
+ The joint distribution of (X, Y ) is unknown.
+ Y ∈ {−1, 1} (classification) or Y ∈ Rm (regression).
+ A predictor is a measurable function in F = {f : X → Y}.

Training data
+ Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. with the same distribution as (X, Y ).

Goal
+ Construct a good predictor b
fn from the training data.
+ Need to specify the meaning of good.

4/39
Loss and Probabilistic Framework

Loss function
+ `(Y , f (X)) measures the goodness of the prediction of Y by f (X).
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k2 .

Risk function
+ Risk measured as the average loss:
R(f ) = E [`(Y , f (X))] .

+ Prediction loss: E [`(Y , f (X))] = P (Y 6= f (X)).


 
+ Quadratic loss: E [`(Y , f (X))] = E kY − f (X)k2 .

fn depends on Dn , R(b
+ Beware: As b fn ) is a random variable!

5/39
A robot that learns

A robot endowed with a set of sensors and an online learning algorithm.

+ Task: play football.


+ Performance: score.
+ Experience: current environment and outcome, past games...

6/39
Object recognition in an image

+ Task: say if an object is present or not in the image.


+ Performance: number of errors.
+ Experience: set of previously seen labeled image.

7/39
Number

+ Task: Read a ZIP code from an envelop.


+ Performance: give a number from an image.
+ Prediction problem with X: image and Y : corresponding number.

8/39
Applications in biology

+ Task: protein interaction network prediction.


+ Goal: predict (unknown) interactions between proteins.
+ Prediction problem with X: pair of proteins and Y : existence or no of
interaction.

9/39
Detection

+ Goal: detect the position of faces in an image.


+ X: mask in the image and Y : presence or no of a face...

10/39
Classification

Setting
+ Historical data about individuals i = 1, . . . , n.
+ Features vector Xi ∈ Rd for each individual i.
+ For each i, the individual belongs to a group (Yi = 1) or not (Yi = −1).
+ Yi ∈ {−1, 1} is the label of i.

Aim
+ Given a new X (with no corresponding label), predict a label in {−1, 1}.
+ Use data Dn = {(x1 , y1 ), . . . , (xn , yn )} to construct a classifier.

11/39
Classification

Geometrically

Learn a boundary to separate two “groups” of points.

12/39
Classification

...many ways to separate points!

13/39
Supervised learning methods

Support Vector Machine

Linear Discriminant Analysis

Logistic Regression

Trees/ Random Forests

Kernel methods

Neural Networks

Many more...

14/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

15/39
Best Solution

The best solution f ∗ (which is independent of Dn ) is

f ∗ = arg minf ∈F R(f ) = arg minf ∈F E [`(Y , f (X))] .

Bayes Predictor (explicit solution)


+ Binary classification with 0 − 1 loss:

+1 if P (Y = 1|X) > P (Y = −1|X)

f ∗ (X) = ⇔ P (Y = 1|X) > 1/2 ,

−1

otherwise .
+ Regression with the quadratic loss
f ∗ (X) = E [Y |X] .

The explicit solution requires to know E [Y |X]...

16/39
Plugin Classifier

+ In many cases, the conditional law of Y given X is not known... or relies on


parameters to be estimated.
+ An empirical surrogate of the Bayes classifier is obtained from a possibly
nonparametric estimator ηbn (X) of

η(x) = P(Y = 1|X)

using the training dataset.


+ This surrogate is then plugged into the Bayes classifier.

Plugin Bayes Classifier

+ Binary classification with 0 −(1 loss:


+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .

17/39
Plugin Classifier

Input: a data set Dn .


Learn the ditribution of Y given X (using the data set) and plug this estimate
in the Bayes classifier.

fn : Rd → {−1, 1}
Output: a classifier b

(
+1 if ηbn (X) > 1/2 ,
fn (X) =
b
−1 otherwise .

+ Can we certify that the plug-in classifier is good ?

18/39
Classification Risk Analysis

The missclassification error satisfies (see exercices):

1/2
fn (X) 6= Y ) − L? 6 2E |η(X) − η̂n (X)|2

0 6 P(b ,

where
L? = P(f ? (X) 6= Y )

and ηbn (x) is an empirical estimate based on the training dataset of

η(x) = P(Y = 1|X = x) .

19/39
How to estimate the conditional law of Y ?

Fully parametric modeling.


Estimate the law of (X, Y ) and use the Bayes formula to deduce an estimate
of the conditional law of Y : LDA/QDA, Naive Bayes...

Parametric conditional modeling.


Estimate the conditional law of Y by a parametric law: linear regression,
logistic regression, Feed Forward Neural Networks...

Nonparametric conditional modeling.


Estimate the conditional law of Y by a non parametric estimate: kernel
methods, nearest neighbors...

20/39
Fully Generative Modeling

If the law of (X, Y ) is known everything can be easy!


Bayes formula
With a slight abuse of notation, if the law of X has a density g with respect to
a reference measure,
gk (X)P(Y =k)
P (Y = k|X) = g(X)
,

where gk is the density of the distribution of X given {Y = k}.

Generative Modeling
Propose a model for (X, Y ).
Plug the conditional law of Y given X in the Bayes classifier.

Remark: require to model the joint law of (X, Y ) rather than only the
conditional law of Y .
Great flexibility in the model design but may lead to complex computation.

21/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

22/39
Naive Bayes

Naive Bayes
Classical algorithm using a crude modeling for P (X|Y ):
+ Feature independence assumption:
d
Y
P X (i) Y

P (X|Y ) = .
i=1

+ Simple featurewise model: binomial if binary, multinomial if finite and


Gaussian if continuous.

If all features are continuous, the law of X given Y is Gaussian with a diagonal
covariance matrix!

Very simple learning even in very high dimension!

23/39
Gaussian Naive Bayes

+ Feature independence assumption:


d
Y
P X (j) Y

P (X|Y ) = .
j=1

For k ∈ {−1, 1}, P (Y = k) = πk and the conditional density of X (j) given


{Y = k} is

2 −1/2

gk (x(j) ) = (2πσj,k ) exp −(x(j) − µj,k )2 /(2σj,k
2
) .

The conditional distribution of X given {Y = k} is then

gk (x) = (det(2πΣk ))−1/2 exp −(x − µk )T Σ−1



k (x − µk )/2 ,

2 2
where Σk = diag(σ1,k , . . . , σd,k ) and µk = (µ1,k , . . . , µd,k )T .

24/39
Gaussian Naive Bayes

In a two-classes problem, the optimal classifier is (see linear discriminant


analysis below):

f ∗ : X 7→ 21{P (Y = 1|X) > P (Y = −1|X)} − 1 .

+ When the parameters are unknown, they may be replaced by their maximum
likelihood estimates.
This yields, for k ∈ {−1, 1},
n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
!
b nk = diag 1X T
Σ bnk ) (Xi − b
(Xi − µ µnk ) 1Yi =k .
n
i=1

25/39
Gaussian Naive Bayes

26/39
Gaussian Naive Bayes

27/39
Gaussian Naive Bayes

28/39
Gaussian Naive Bayes

29/39
Kernel density estimate based Naive Bayes

30/39
Outline

Introduction to supervised learning

Bayes and Plug-in classifiers

Naive Bayes

Discriminant analysis (linear and quadratic)

31/39
Discriminant Analysis

Discriminant Analysis (Gaussian model)


The conditional densities are modeled as multivariate normal. For all class k,
conditionnally on {Y = k},
X ∼ N (µk , Σk ) .

Discriminant functions:

gk (X) = ln(P{X|Y = k}) + ln(P (Y = k)) .

In a two-classes problem, the optimal classifier is (see exercises):

f ∗ : x 7→ 21{g1 (x ) > g−1 (x )} − 1 .

QDA (differents Σk in each class) and LDA (Σk = Σ for all k)


lightr: this model can be false but the methodology remains valid!

32/39
Discriminant Analysis

Estimation
In practice, µk , Σk and πk := P (Y = k) have to be estimated.
nk 1
Pn
+ Estimated proportions π
bk = n
= n i=1
1{Yi =k} .

+ Maximum likelihood estimate of µbk and Σ


ck (explicit formulas).

The DA classifier then becomes(


+1 g−1 (X) ,
g1 (X) ≥ b
if b
fn (X) =
b
−1 otherwise .

If Σ−1 = Σ1 = Σ then the decision boundary is an affine hyperplane.

33/39
The loglikelihood of the observations is given by
n
X
log Pθ (X1:n , Y1:n ) = log Pθ (Xi , Yi ) ,
i=1
n
! n
!
nd n X X
=− log(2π) − log det(Σ) + 1Yi =1 log π1 + 1Yi =−1 log(1 − π1 )
2 2
i=1 i=1
n n
1X 1X
− 1Yi =1 (Xi − µ1 )T Σ−1 (Xi − µ1 ) − 1Yi =−1 (Xi − µ−1 )T Σ−1 (Xi − µ−1 ) .
2 2
i=1 i=1

This yields, for k ∈ {−1, 1},


n
1X
bkn =
π 1Yi =k ,
n
i=1
n
1 X
bnk = Pn
µ 1Yi =k Xi ,
1
i=1 Yi =k i=1
n
bn = 1X T
bnYi bnYi

Σ Xi − µ Xi − µ .
n
i=1

Remains to plug these estimates in the classification boundary.


34/39
Example: LDA

35/39
Example: LDA

36/39
Example: QDA

37/39
Example: QDA

38/39
Packages in R

Function svm in package e1071.

Function lda and qda in package MASS.

Function naive_bayes in package naivebayes.

39/39

You might also like