0% found this document useful (0 votes)
143 views

MLE and MAP Classifier

Introduction on Most Likelihood Estimate (MLE) Classifier and the Maximum-A-Posterior (MAP) Classifier, which is also known as Bayes Classifier.

Uploaded by

Supriyo Chakma
Copyright
© Attribution Non-Commercial No-Derivs (BY-NC-ND)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

MLE and MAP Classifier

Introduction on Most Likelihood Estimate (MLE) Classifier and the Maximum-A-Posterior (MAP) Classifier, which is also known as Bayes Classifier.

Uploaded by

Supriyo Chakma
Copyright
© Attribution Non-Commercial No-Derivs (BY-NC-ND)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Classifier based on

Maximum Likelihood (ML) Event (MLE)


and
Maximum-A-Posterior Probability (MAP)

Objective: Learn the Basics of


(i) ML Classifier (OR) MLE classifier
(ii) MAP classifier (also called : Bayes Classifier)
1
Road Map
a lot is
known
”easier Problem”

little is
known
“harder Problem”
A Lot is Known : Easier Problem
Know probability distribution of the a lot is
categories or classes (both shape & param of probability distribution known
is known) ”easier”
never happens in real world
Do not even need training data
Can design optimal classifier
Example
respected fish expert says that salmon’s length
has distribution N(5,1) and sea bass’s length
has distribution N(10,4). (units are in inches)

Question:
Math 1.1, little is
1.2, 1.3, 1.4
known
salmon sea bass “harder”
Difficult: Shape of Distribution Known, but Parameters Unknown
Shape of probability distribution is known, but parameters of a lot is
the distribution NOT Known Happens sometimes known
”easier”
Labeled Training data salmon bass salmon salmon

Need to estimate parameters of probability


distribution from the training data

Example
respected fish expert says salmon’s
length has distribution N(µ1 ,σ1 ) and sea
2

bass’s length has distribution N(µ2 ,σ 22 )


Need to estimate parameters µ1 ,σ12 , µ2 ,σ 22
Then can use the methods from the little is
Question:
Math 2.1
MLE or bayesian decision theory known
“harder”
More Difficult: No Probability Distribution Known, No Parameters Known
No probability distribution (no shape or a lot is
parameters are known) known

Labeled data salmon bass salmon salmon

The shape of discriminant functions is known

ba linear
lightness

ss discriminant
function
sa
lm
on

length
Need to estimate parameters of the
little is
discriminant function (e.g., parameters of known
the line in case of linear discriminant func, e.g., (m, c) of line )
Very Difficult: Nothing Known, Only has Class Labeled Data
Neither probability distribution nor a lot is
known
discriminant function is known ”easier”
Happens quite often
All we have is labeled data (i.e., class labels known)
salmon bass salmon salmon

Estimate BOTH the shape of


probability distribution or discriminant
function and the Parameters from the
class labeled data
little is
known
“harder”
Most Difficult: Clustering and Unsupervised Learning
a lot is
Data is not labeled (NO class information known) known
”easier”
Happens quite often

1. Cluster the data (Unsupervised learning)


2. After identifying classes, Compute the probability distribution
& Parameters of each class to build a representation for
each class data

little is
Q: SORT / EXPLAIN the Five different categories of Machine Learning problems
based on their level of difficulty. known
“harder”
*** Skip ***
Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known

Do not even need training data


Can design optimal classifier
2. ML and Bayesian parameter estimation
Need to estimate Parameters of probability dist.
Need training data
3. Non-Parametric Methods
No probability distribution, labeled data
4. Linear discriminant functions and Neural Nets
The shape of discriminant functions is known
Need to estimate parameters of discriminant functions
5. Unsupervised Learning and Clustering little is
No probability distribution and unlabeled data known
*** Skip Until Page 19 (Cats and Dogs) if you know Probability Basics ***
Notation Change (for consistency with textbook)

Pr [A] probability of event A


P(x) probability mass function of
discrete r.v. x
p(x) probability density function of
continuous r.v. x
p(x,y) joint probability density of r.v. x
and y
p(x|y) conditional density of x given y
P(x|y) conditional mass of x given y
More on Probability
For events A and B, we have defined
conditional law of total
probability probability
U n
Pr(A B)
Pr(A|B)= Pr ( A) = Pr (A | Bk ) Pr (Bk )
Pr(B) k =1

Pr (A | B i ) Pr (B i )
Bayes’ rule Pr (Bi | A ) = n
Pr (A | Bk ) Pr (Bk )
k =1

Usually model with random variables not events.


Need equivalents of these laws for mass and density
functions (could go from random variables back to
events, but time consuming)
Conditional Mass Function: Discrete RV
For discrete RV nothing new because mass
function is really a probability law
Define conditional mass function of X given Y=y
by P (x , y )
P (x | y ) =
P (y )
y is fixed
This is a probability mass function because:
P (x , y )
P (y )
P (x | y ) = ∀x
= =1
∀x P (y ) P (y )

This is really nothing new because:


P ( x , y ) Pr [X = x Y = y ]
( )
P x |y = = = Pr [X = x |Y = y ]
P (y ) Pr [Y = y ]
Conditional Mass Function: Bayes Rule

The law of Total Probability:


P (x ) = P (x , y ) = P ( x | y )P (y )
∀y ∀y

The Bayes Rule:


P (y , x ) P ( x | y )P (y )
P (y | x ) = =
P (x ) P ( x | y )P (y )
∀y
Conditional Density Function: Continuous RV
Does it make sense to talk about conditional density
p(x|y) if Y is a continuous random variable? After
all, Pr[Y=y]=0, so we will never see Y=y in practice
Measurements have limited accuracy. Can interpret
observation y as observation in interval [y-ε, y+ε], and
observation x as observation in interval [x-ε, x+ε]
y-ε y+ε x-ε x+ε
y x
Conditional Density Function: Continuous RV
p(x)
Let B(x) denote interval [x-ε,x+ε]
x +ε
Pr [X ∈ B( x )] = p( x )dx ≈ 2ε p( x )
x −ε x-ε x x+ε
Similarly Pr [Y ∈ B(y )] ≈ 2ε p(y )
Pr [X ∈ B( x ) Y ∈ B(y )] ≈ 4ε 2 p( x , y )

Pr [X ∈ B( x ) | Y ∈ B(y )]
Thus we should have p( x | y ) ≈

Which can be simplified to:
Pr [X ∈ B( x ) Y ∈ B(y )] p( x , y )
p( x | y ) ≈ ≈
2ε Pr [Y ∈ B(y )] p(y )
Conditional Density Function: Continuous RV
Define conditional density function of X given Y=y
by p (x , y )
( )
p x |y =
p (y )
y is fixed

This is a probability density function because:



p ( x , y )dx
p (x , y ) p (y )
∞ ∞
p ( x | y ) dx = dx = −∞
= =1
−∞ −∞
p (y ) p (y ) p (y )

The law of Total Probability:


∞ ∞
p( x ) = p( x , y ) dy = p( x | y )p(y )dy
−∞ −∞
Conditional Density Function: Bayes Rule
The Bayes Rule:
p (y , x ) p ( x | y )p (y )
p (y | x ) = =
p (x ) ∞
p ( x | y )p (y ) dy
−∞
Mixed Discrete and Continuous
X discrete, Y continuous
Bayes rule
p (y | x )P ( x )
P (x | y ) =
p (y )

X continuous, Y discrete
Bayes rule
P (y | x )p ( x )
p (x | y ) =
P (y )
Bayesian Decision Theory

Know probability distribution of the


categories
Almost never the case in real life!
Nevertheless useful since other cases can be
reduced to this one after some work
Do not even need training data
Can design optimal classifier
Q: (Math 1.1) Animal scientists have found that the probability of finding small ears in Cats and Dogs
are 0.8 and 0.1, respectively. Suppose An animal is observed with Large Ears.
(i) What is the probability that the observed animal is a dog?
(ii) To which class the test animal will be classified by the MLE classifier - CAT or DOG?

21
Question: See Previous Page !

Cats and Dogs


Suppose we have these conditional probability
mass functions for cats and dogs
P(small ears | dog) = 0.1, P(large ears | dog) = 0.9
P(small ears | cat) = 0.8, P(large ears | cat) = 0.2
Observe an animal with large ears
Dog or a cat? likelihood = P (feature | class) ... called likelihood of that
particular feature in the given class
Makes sense to say dog because probability of
observing large ears in a dog is much larger than
probability of observing large ears in a cat
(likelihood) Pr[large ears | dog] = 0.9 > 0.2= Pr[large ears | cat] = 0.2
SO ===> Classify as "DOG" (Solved)
Core Idea of MLE classifier: Choose the event of
largest probability, i.e. maximum likelihood event (here, the
events are "being Dog" and "being Cat" and the Maximum
Likelihood event is "being Dog"
Example: Fish Sorting
Respected fish expert says that The variances are 1
and 4 cm^2, which

Salmon’ length has distribution N(5,1) means the STD values


(standard deviations)
are 1 and 2 cm,
Sea bass’s length has distribution N(10,4) respectively.

Recall if r.v. is N(µ,σ ) Probability


2
then it’s density is Density
Function/PDF
( l − µ )2
1 −
p(l ) = e 2σ 2
(Dont Miss the 'minus' in the
σ 2π exponent)
Thus class conditional densities are
class conditional PDFs
( l −5)2 ( l −10)2
1 − 1 −
p(l |bass) =

p(l | salmon) = e 2*1 e 2*4
σ 2π σ 2π
23
Likelihood function
Thus class conditional densities are
(l −5)2 ( l −10)2
1 − − 1 −
p(l | salmon) = e 2*1
p(l |bass) = e 2*4
fixed σ 2π fixed σ 2π
Fix length, let fish class vary. Then we get
likelihood function (it is not density and not
probability mass function)
This is called
( l −5 )2
class conditional probability 1 − 2 −
desnity value: e if class = salmon
p(l | class ) = σ 2π ( l −10 )2
1 −
fixed e 8 if class = bass
( i.e.,
likelihood of finding the
σ1 2π
24
fixed length of 'l' in the class 'salmon' or 'bass' )
Likelihood vs. Class Conditional Density

p(l | class)

7 length

Suppose a fish has length 7. How do we classify it?


Q: (Math 1.2) fish experts have found that the length of Salmon & Bass fishes follow Gaussian
distribution (i.e., Normal distribution) with mean of 5 and 10 inches, respectively and varince
25
of 1 and 4 inch2, respectively. A fish is observed with length of 7 inch. Explain how an (i) MLE and
(ii) MAP(Bayes) classifiers would classify it - Salmon or Bass? (iii) What will be the classification
decision by them if (a) Salmon & Bass are equally likely (b) Salmon is twice as likely as Bass?
ML (maximum likelihood) Classifier
We would like to choose salmon if
Pr[length= 7 | salmon] > Pr[length= 7 | bass]
However, since length is a continuous r.v.,
Pr[length= 7 | salmon] = Pr[length= 7 | bass] = 0

Instead, we choose class which maximizes likelihood


(l −10)
1 − (l −25)
2 2
1 −
p(l | salmon) = e p(l |bass) = e 2*4
σ 2π σ 2π
ML classifier: for an observed l:
bass <
p(l | salmon) ? p(l | bass) in words: if p(l | salmon) > p(l | bass),
classify as salmon, else classify as bass
> salmon
Q: What is decision boundary? The Decision Boundary between

Decision Boundary
class i and class j is the set of All Points where both the classes
have equal likelihood value (MLE classifier), or equal posterior
probability value (Bayes classifier) or equal discrminant function
value (for the Minimum error rate classifier)

classify as salmon classify as sea bass

6.70 length 28
Q. (Math 1.3) Find the Decision Boundary between the Salmon and Bass classes based on their length, when
no prior knowledge is available.
Priors
Prior comes from prior knowledge, no data
has been seen yet
Suppose a fish expert says: in the fall, there
are twice as many salmon as sea bass
Prior for our fish sorting problem
P(salmon) = 2/3
P(bass) = 1/3

With the addition of prior to our model, how


should we classify a fish of length 7?
(Math 1.4) Fish experts have found that there are twice as many Salmon as sea bass. With this 29
prior knowledge, How should the ML classifier and Bayes classifier classify a fish of length 7.0?
Ans: for MLE classifier, NO MATTER / no effect of "twice as likely" . For Bayes classifier, see scanned solution!
How Prior Changes Decision Boundary?
Without priors
salmon sea bass
6.70 length

How should this change with prior?


P(salmon) = 2/3
P(bass) = 1/3

salmon ? ? sea bass


6.70 length

30
In the presence of prior probability, we need Bayes Decision theory. (The MLE classifier does not consider prior probability)

Bayes Decision Rule


1. Have likelihood functions
p(length | salmon) and p(length | bass)
2. Have priors P(salmon) and P(bass)
Question: Having observed fish of certain
length, do we classify it as salmon or bass?
Natural Idea:
salmon if P(salmon| length) > P(bass| length)
bass if P(bass| length) > P(salmon| length)
31
P ( feature | class ) = called likelihood P(class) = called prior probability
P ( class | feature ) = called Posterior probability
Posterior P ( class | data ) = Posterior probability of being in the 'class' AFTER we have
seen the 'data'

P(salmon | length) and P(bass | length)


are called posterior distributions, because
the data (length) was revealed (post data)
How to compute posteriors? Not obvious
From Bayes rule:
P (A | B ) = P (AB) / P(B) = P ( B | A) * P(A) / P(B)

p(salmon, length) p(length| salmon)P(salmon)


P(salmon| length) = =
p(length) p(length)

Similarly:
p(length| bass)P(bass)
P(bass| length) =
p(length)
32
Bayes Classifier, also Called
MAP (maximum a posteriori) classifier
> salmon
P (salmon | length) ? P (bass | length)
bass <

salmon
p(length | salmon)P (salmon) > p(length | bass )P (bass )
?
p(length ) bass < p(length )

>salmon
p(length| salmon)P (salmon) ? p(length| bass)P (bass)
bass <

33
Back to Fish Sorting Example
likelihood (l −5)2 (l −10)2
1 − − 1 − −
p(l | salmon) = e 2
p(l |bass) = e 8
σ 2π σ 2π

Priors: P(salmon) = 2/3, P(bass) = 1/3


( l −5 )2 ( l −10 )2
1 − 2 1 − 1
Solve inequality e 2
∗ > e 8

σ 2π 3 σ 2 2π 3
new decision
salmon boundary sea bass
6.70 7.18 length
* Do Calculation to find
this Decision boundary ***
New decision boundary makes sense since
we expect to see more salmon 34
Q 1.5 Find the decision boundary for the above problem for the Bayes / MAP classifier
============== Read => Up to this Point (Skip the Next Slides) =============
Prior P(s)=2/3 and P(b)= 1/3 vs.
Prior P(s)=0.999 and P(b)= 0.001

salmon
bass

7.1 8.9 length


Likelihood vs Posteriors
likelihood
P(salmon|l) P(bass|l)
p(l|fish class)

density with
respect to
p(l|salmon) length, area
p(l|bass) under the
curve is 1

length

posterior P(fish class| l)


mass function with respect to fish class, so for
each l, P(salmon| l )+P(bass| l ) = 1
More on Posterior

posterior density likelihood Prior


(our goal) (given) (given)
P( l | c) P(c)
P(c | l ) =
P(l )
normalizing factor, often do not even need
it for classification since P(l) does not
depend on class c. If we do need it, from
the law of total probability:
P (l ) = p(l | salmon )p(salmon ) + p(l | bass )p(bass )
Notice this formula consists of likelihoods
and priors, which are given
More on Posterior
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
cause (class) c l effect (length)
If cause c is present, it easy to determine the
probability of effect l with likelihood P(l|c)
Usually observe the effect l without knowing cause c.
Hard to determine cause c because there may be
several causes which could produce same effect l
Bayes rule makes I easy to determine posterior
P(c|l), if we know likelihood P(l|c) and prior P(c)
More on Priors
Prior comes from prior knowledge, no data
has been seen yet
If there is a reliable source prior knowledge,
it should be used
Some problems cannot even be solved
reliably without a good prior
However prior alone is not enough, we still
need likelihood
P(salmon)=2/3, P(sea bass)=1/3
If I don’t let you see the data, but ask you to
guess, will you choose salmon or sea bass?
39
More on Map Classifier
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
Do not care about P(l) when maximizing P(c|l )
P(c | l ) P( l | c) P(c)
proportional

If P(salmon)=P(bass) (uniform prior) MAP classifier
becomes ML classifier P(c | l ) ∝P( l | c)

If for some observation l, P(l|salmon)=P(l|bass), then


this observation is uninformative and decision is
based solely on the prior P(c | l ) ∝ P(c)
Justification for MAP Classifier
Let’s compute probability of error for the
MAP estimate:
> salmon
P (salmon | l ) ? P (bass | l )
bass <

For any particular l, probability of error


P(bass|l) if we decide salmon
Pr[error| l ]=
P(salmon|l) if we decide bass

Thus MAP classifier is optimal for each


individual l ! 41
Justification for MAP Classifier
We are interested to minimize error not just for
one l, we really want to minimize the average
error over all l
∞ ∞
Pr [error ] = p(error , l )dl = Pr [error | l ]p(l )dl
−∞ −∞

If Pr[error| l ]is as small as possible, the integral is


small as possible
But Bayes rule makes Pr[error| l ] as small as
possible

Thus MAP classifier minimizes the probability of error!


Today

Bayesian Decision theory


Multiple Classes
General loss functions
Multivariate Normal Random Variable
Classifiers
Discriminant Functions

43
More General Case

Let’s generalize a little bit


Have more than one feature x = [x1 , x 2 ,..., xd ]
Have more than 2 classes { c1 , c2 ,...,cm }
More General Case
As before, for each j we have
( )
p x | c j is likelihood of observation x given that
the true class is c j
( )
P c j is prior probability of class c j
( )
P c j | x is posterior probability of class c j given
that we observed data x
Evidence, or probability density for data

( ) ( )
m
p( x ) = p x | cj P cj
j =1

45
Minimum Error Rate Classification
Want to minimize average probability of error
Pr [error ] = p(error , x )dx = Pr [error | x ]p( x )dx
need to make this
as small as possible

Pr [error | x ] = 1 − P (ci | x ) if we decide class ci

Pr [error | x ] is minimized with MAP classifier


Decide on class ci if 1
( )
P (ci | x ) > P c j | x ∀j ≠ i
1-P(c1|x) 1-P(c |x)
1-P(c3|x)
2
MAP classifier is optimal P(c3|x)
P(c1|x)
If we want to minimize the P(c2|x)
probability of error
General Bayesian Decision Theory
In close cases we may want to refuse to
make a decision (let human expert handle
tough case)
allow actions {α1 ,α 2 ,...,α k }
Suppose some mistakes are more costly
than others (classifying a benign tumor as
cancer is not as bad as classifying cancer
as benign tumor)
Allow loss functions λ (α i | c j ) describing loss
occurred when taking action α i when the true
class is c j 47
Conditional Risk
Suppose we observe x and wish to take
action α i
If the true class is c j , by definition, we incur
loss λ (αi | c j )
Probability that the true class is c j after
observing x is P (c j | x )
The expected loss associated with taking
action α i is called conditional risk and it is:
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1
Conditional Risk

sum over disjoint events probability of


(different classes) class c j given
observation x

c1 λ(αi|c1)
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1 c 2 λ(αi|c2)
penalty for
taking action α i part of overall penalty c 3 λ(αi|c3)
if observe x which comes from event c 4 λ(αi|c4)
that true class is c j
Example: Zero-One loss function
action α i is decision that true class is ci

λ (α i | c j )
0 if i = j (no mistake)
=
1 otherwise (mistake)

λ (αi | c j )P (c j | x ) = ( )
m
R(αi | x ) = P cj | x =
j =1 i≠ j

= 1 − P (ci | x ) = Pr [error if decide ci ]

Thus MAP classifier optimizes R(αi|x)


P(ci | x) > P(cj | x) ∀j ≠ i
MAP classifier is Bayes decision rule under
zero-one loss function
Overall Risk
Decision rule is a X α(x1)
x1
function α(x) which for x2 α(x2) {α1 ,α2 ,...,α k }
every x specifies action x3 α(x3)
out of {α1 ,α2 ,...,αk }
The average risk for α(x)
R(α ) = R(α ( x ) | x )p( x )dx
need to make this as small as possible
Bayes decision rule α(x) for every x is the action
which minimizes the conditional risk
λ (α i | c j )P (c j | x )
m
R(α i | x ) =
j =1

Bayes decision rule α(x) is optimal, i.e. gives the


minimum possible overall risk R *
Bayes Risk: Example
Salmon is more tasty and expensive than sea bass
λsb = λ (salmon | bass) = 2 classify bass as salmon
λbs = λ (bass | salmon) = 1 classify salmon as bass
λss = λbb = 0 no mistake, no loss
( l −5 )2 ( l −10)2
1 − 1 −
Likelihoods p(l | salmon) = e 2
p(l |bass) = e 2*4
2π 2 2π
Priors P(salmon)= P(bass)
λ (α | c j )P (c j | x ) = λα s P (s | l ) + λα bP (b | l )
m
Risk R(α | x ) =
j =1

R(salmon | l ) = λssP (s | l ) + λsbP (b | l ) = λsbP (b | l )


R(bass | l ) = λbsP (s | l ) + λbbP (b | l ) = λbsP (s | l )
Bayes Risk: Example
R(salmon | l ) = λsbP (b | l ) R(bass | l ) = λbsP (s | l )

Bayes decision rule (optimal for our loss function)


< salmon
λsbP (b | l )? λbsP (s | l )
> bass
P (b | l ) λbs
Need to solve <
P (s | l ) λsb
Or, equivalently, since priors are equal:
P (l | b)P (b)p(l ) P (l | b) λbs
= <
p(l )P (l | s )P (s ) P (l | s ) λsb
Bayes Risk: Example
Need to solve
P (l | b) λbs
<
P (l | s ) λsb

Substituting likelihoods and losses


( l −10 )2 ( l −10 )2 ( l −10 )2

− −
− −

2 ⋅ 2π exp 8
exp 8
exp 8

( l −5 )2
<1 ⇔ (l −5 )2
< 1 ⇔ ln
( l −5 )2
< ln(1) ⇔

− −
− −

1 ⋅ 2 2π exp 2
exp 2
exp 2

(l − 10 )2 (l − 5 )2
⇔ −− + + < 0 ⇔ 3l 2 − 20l < 0 ⇔ l < 6.6667
8 2
new decision
salmon boundary sea bass
6.67 6.70 length
Likelihood Ratio Rule
In 2 category case, use likelihood ratio rule
P ( x | c1 ) λ12 − λ22 P (c2 )
>
P ( x | c2 ) λ21 − λ11 P (c1 )

likelihood fixed number


ratio Independent of x

If above inequality holds, decide c1


Otherwise decide c2

55
Discriminant Functions
All decision rules have the same structure:
at observation x choose class ci s.t.
gi ( x ) > g j ( x ) ∀j ≠ i
discriminant
function

ML decision rule: gi ( x ) = P ( x | ci )

MAP decision rule: gi ( x ) = P (ci | x )

Bayes decision rule: gi ( x ) = −R(ci | x )


Discriminant Functions
Classifier can be viewed as network which
computes m discriminant functions and selects
category corresponding to the largest discriminant
select class
giving maximim

discriminant
g1( x) g2( x) gm(x)
functions

x1 x2 x3 xd
features
gi(x) can be replaced with any monotonically
increasing function, the results will be unchanged
Decision Regions
Discriminant functions split the feature
vector space X into decision regions

g2 ( x ) = max{gi }
c1
c2
c3

c3
c1

58
Important Points
If we know probability distributions for the
classes, we can design the optimal
classifier
Definition of “optimal” depends on the
chosen loss function
Under the minimum error rate (zero-one loss
function
No prior: ML classifier is optimal
Have prior: MAP classifier is optimal
More general loss function
General Bayes classifier is optimal
59

You might also like