MLE and MAP Classifier
MLE and MAP Classifier
little is
known
“harder Problem”
A Lot is Known : Easier Problem
Know probability distribution of the a lot is
categories or classes (both shape & param of probability distribution known
is known) ”easier”
never happens in real world
Do not even need training data
Can design optimal classifier
Example
respected fish expert says that salmon’s length
has distribution N(5,1) and sea bass’s length
has distribution N(10,4). (units are in inches)
Question:
Math 1.1, little is
1.2, 1.3, 1.4
known
salmon sea bass “harder”
Difficult: Shape of Distribution Known, but Parameters Unknown
Shape of probability distribution is known, but parameters of a lot is
the distribution NOT Known Happens sometimes known
”easier”
Labeled Training data salmon bass salmon salmon
Example
respected fish expert says salmon’s
length has distribution N(µ1 ,σ1 ) and sea
2
ba linear
lightness
ss discriminant
function
sa
lm
on
length
Need to estimate parameters of the
little is
discriminant function (e.g., parameters of known
the line in case of linear discriminant func, e.g., (m, c) of line )
Very Difficult: Nothing Known, Only has Class Labeled Data
Neither probability distribution nor a lot is
known
discriminant function is known ”easier”
Happens quite often
All we have is labeled data (i.e., class labels known)
salmon bass salmon salmon
little is
Q: SORT / EXPLAIN the Five different categories of Machine Learning problems
based on their level of difficulty. known
“harder”
*** Skip ***
Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known
Pr (A | B i ) Pr (B i )
Bayes’ rule Pr (Bi | A ) = n
Pr (A | Bk ) Pr (Bk )
k =1
Pr [X ∈ B( x ) | Y ∈ B(y )]
Thus we should have p( x | y ) ≈
2ε
Which can be simplified to:
Pr [X ∈ B( x ) Y ∈ B(y )] p( x , y )
p( x | y ) ≈ ≈
2ε Pr [Y ∈ B(y )] p(y )
Conditional Density Function: Continuous RV
Define conditional density function of X given Y=y
by p (x , y )
( )
p x |y =
p (y )
y is fixed
X continuous, Y discrete
Bayes rule
P (y | x )p ( x )
p (x | y ) =
P (y )
Bayesian Decision Theory
21
Question: See Previous Page !
p(l | class)
7 length
Decision Boundary
class i and class j is the set of All Points where both the classes
have equal likelihood value (MLE classifier), or equal posterior
probability value (Bayes classifier) or equal discrminant function
value (for the Minimum error rate classifier)
6.70 length 28
Q. (Math 1.3) Find the Decision Boundary between the Salmon and Bass classes based on their length, when
no prior knowledge is available.
Priors
Prior comes from prior knowledge, no data
has been seen yet
Suppose a fish expert says: in the fall, there
are twice as many salmon as sea bass
Prior for our fish sorting problem
P(salmon) = 2/3
P(bass) = 1/3
30
In the presence of prior probability, we need Bayes Decision theory. (The MLE classifier does not consider prior probability)
Similarly:
p(length| bass)P(bass)
P(bass| length) =
p(length)
32
Bayes Classifier, also Called
MAP (maximum a posteriori) classifier
> salmon
P (salmon | length) ? P (bass | length)
bass <
salmon
p(length | salmon)P (salmon) > p(length | bass )P (bass )
?
p(length ) bass < p(length )
>salmon
p(length| salmon)P (salmon) ? p(length| bass)P (bass)
bass <
33
Back to Fish Sorting Example
likelihood (l −5)2 (l −10)2
1 − − 1 − −
p(l | salmon) = e 2
p(l |bass) = e 8
σ 2π σ 2π
salmon
bass
density with
respect to
p(l|salmon) length, area
p(l|bass) under the
curve is 1
length
43
More General Case
( ) ( )
m
p( x ) = p x | cj P cj
j =1
45
Minimum Error Rate Classification
Want to minimize average probability of error
Pr [error ] = p(error , x )dx = Pr [error | x ]p( x )dx
need to make this
as small as possible
c1 λ(αi|c1)
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1 c 2 λ(αi|c2)
penalty for
taking action α i part of overall penalty c 3 λ(αi|c3)
if observe x which comes from event c 4 λ(αi|c4)
that true class is c j
Example: Zero-One loss function
action α i is decision that true class is ci
λ (α i | c j )
0 if i = j (no mistake)
=
1 otherwise (mistake)
λ (αi | c j )P (c j | x ) = ( )
m
R(αi | x ) = P cj | x =
j =1 i≠ j
( l −5 )2
<1 ⇔ (l −5 )2
< 1 ⇔ ln
( l −5 )2
< ln(1) ⇔
−
− −
− −
−
1 ⋅ 2 2π exp 2
exp 2
exp 2
(l − 10 )2 (l − 5 )2
⇔ −− + + < 0 ⇔ 3l 2 − 20l < 0 ⇔ l < 6.6667
8 2
new decision
salmon boundary sea bass
6.67 6.70 length
Likelihood Ratio Rule
In 2 category case, use likelihood ratio rule
P ( x | c1 ) λ12 − λ22 P (c2 )
>
P ( x | c2 ) λ21 − λ11 P (c1 )
55
Discriminant Functions
All decision rules have the same structure:
at observation x choose class ci s.t.
gi ( x ) > g j ( x ) ∀j ≠ i
discriminant
function
ML decision rule: gi ( x ) = P ( x | ci )
discriminant
g1( x) g2( x) gm(x)
functions
x1 x2 x3 xd
features
gi(x) can be replaced with any monotonically
increasing function, the results will be unchanged
Decision Regions
Discriminant functions split the feature
vector space X into decision regions
g2 ( x ) = max{gi }
c1
c2
c3
c3
c1
58
Important Points
If we know probability distributions for the
classes, we can design the optimal
classifier
Definition of “optimal” depends on the
chosen loss function
Under the minimum error rate (zero-one loss
function
No prior: ML classifier is optimal
Have prior: MAP classifier is optimal
More general loss function
General Bayes classifier is optimal
59