0% found this document useful (0 votes)
7 views57 pages

generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c

The document provides an overview of Generative Learning Algorithms, focusing on Gaussian Discriminant Analysis (GDA) and Naive Bayes. It explains the differences between generative and discriminative algorithms, the mathematical foundations of GDA, and how it can be applied to both binary and multiclass classification problems. Additionally, it discusses the relationship between GDA and logistic regression, highlighting the conditions under which GDA can be expressed as a logistic function.

Uploaded by

achouriarij59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c

The document provides an overview of Generative Learning Algorithms, focusing on Gaussian Discriminant Analysis (GDA) and Naive Bayes. It explains the differences between generative and discriminative algorithms, the mathematical foundations of GDA, and how it can be applied to both binary and multiclass classification problems. Additionally, it discusses the relationship between GDA and logistic regression, highlighting the conditions under which GDA can be expressed as a logistic function.

Uploaded by

achouriarij59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Machine Learning

Generative Learning Algorithms

Mohamed Farah

ISAMM

2024-2025

Mohamed Farah (ISAMM) Machine Learning 2024-2025 1 / 56


Outline

1 Introduction to Generative Learning Algorithms

2 Gaussian Discriminant Analysis (GDA)

3 Naive Bayes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 2 / 56


Introduction to Generative Learning Algorithms

Mohamed Farah (ISAMM) Machine Learning 2024-2025 3 / 56


Generative vs. Discriminative Learning I

Discriminative Algorithms: Model p(y |x; θ) directly.


Focus on finding a decision boundary to separate classes.
Examples: Logistic Regression, Perceptron.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 4 / 56


Generative vs. Discriminative Learning II

Generative Algorithms: Model p(x|y ) and p(y ) then derive the


posterior distribution on y given x using the Bayes rule:

p(x|y )p(y )
p(y |x) = .
p(x)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 5 / 56


Generative vs. Discriminative Learning III

Focus on modeling the distribution of each class p(x|y ) as well as p(y )


(called the class priors).
In order to make a prediction, choose

p(x|y )p(y )
y ∗ = arg max p(y |x) = arg max
y y p(x)
= arg max p(x|y )p(y ).
y

We don’t actually need to calculate the denominator p(x)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 6 / 56


Gaussian Discriminant Analysis (GDA)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 7 / 56


Multivariate Normal Distribution I
Also called the multivariate Gaussian distribution
Density function:
 
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)n/2 |Σ|1/2 2
Mean: Z
E[X ] = x p(x; µ, Σ) dx = µ.
x
µ ∈ Rd
Covariance of a vector-valued random variable x :

Cov(X ) =E[(X − E[X ])(X − E[X ])T ]


= E[X X T ] − (E[X ])(E[X ])T
= Σ
Σ ∈ Rd×d , where Σ ≥ 0 is symmetric and positive semi-definite
Mohamed Farah (ISAMM) Machine Learning 2024-2025 8 / 56
Multivariate Normal Distribution II

Examples of the density of a Gaussian distribution:

       
0 1 0 0 0
N (µ = , Σ == = I) N (µ = , Σ = 0.6I ) N (µ = , Σ = 2I )
0 0 1 0 0

As Σ becomes larger, the Gaussian becomes more “spread-out,” and


as it becomes smaller, the distribution becomes more “compressed.”

Mohamed Farah (ISAMM) Machine Learning 2024-2025 9 / 56


Multivariate Normal Distribution III

Examples of the density of a Gaussian distribution:

           
0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1

As we increase the off-diagonal entry in Σ, the density becomes more


“compressed” towards the 45◦ line (given by x1 = x2 ).

Mohamed Farah (ISAMM) Machine Learning 2024-2025 10 / 56


Multivariate Normal Distribution IV
Examples of the density of a Gaussian distribution:

           
0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1

The contours of the same three densities


Mohamed Farah (ISAMM) Machine Learning 2024-2025 11 / 56
Gaussian Discriminant Analysis (GDA) I
We have a classification problem.
Start with a binary classification.
The input features x are continuous-valued random variables
We assume p(x|y ) is a multivariate normal distribution:
Model:
y ∼ Bernoulli(ϕ)
x|y = 0 ∼ N (µ0 , Σ)
x|y = 1 ∼ N (µ1 , Σ)
Parameters: ϕ, µ0 , µ1 and Σ
Distributions :
 
1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 ) ,
(2π)d/2 |Σ|1/2 2
 
1 1 T −1
p(x|y = 1) = exp − (x − µ1 ) Σ (x − µ1 ) .
(2π)d/2 |Σ|1/2 2
Mohamed Farah (ISAMM) Machine Learning 2024-2025 12 / 56
Deriving the GDA Model
Goal: Estimate parameters ϕ, µ0 , µ1 , Σ using maximum likelihood.
Likelihood of the data D:
n
Y
L(D; ϕ, µ0 , µ1 , Σ) = p(x (i) , y (i) ; ϕ, µ0 , µ1 , Σ)
i=1

Decompose into class-conditional probabilities:


p(x (i) , y (i) ) = p(x (i) |y (i) )p(y (i) )
For y (i) = 0:
 
(i) (i) 1 1 (i) T −1 (i)
p(x |y = 0) = exp − (x − µ0 ) Σ (x − µ0 )
(2π)n/2 |Σ|1/2 2
For y (i) = 1:
 
(i) (i) 1 1 (i) T −1 (i)
p(x |y = 1) = exp − (x − µ1 ) Σ (x − µ1 )
(2π)n/2 |Σ|1/2 2
Mohamed Farah (ISAMM) Machine Learning 2024-2025 13 / 56
Maximizing the Likelihood I

Log-likelihood:
n
X n
X
(i) (i)
ℓ(D; ϕ, µ0 , µ1 , Σ) = log p(x |y )+ log p(y (i) )
i=1 i=1

Maximize with respect to ϕ, µ0 , µ1 , Σ.


Results:
n
1 X
ϕ= 1{y (i) = 1}
m
i=1

i=1 1{y
Pn (i) = 0}x (i)
µ0 = P
i=1 1{y
n (i) = 0}

i=1 1{y
Pn (i) = 1}x (i)
µ1 = P
i=1 1{y
n (i) = 1}

Mohamed Farah (ISAMM) Machine Learning 2024-2025 14 / 56


Maximizing the Likelihood II
n
1 X (i)
Σ= (x − µy (i) )(x (i) − µy (i) )T
m
i=1

What the algorithm is doing can be seen as follows:

The contours of the two Gaussian distributions that have been fit to
the data in each of the two classes
Mohamed Farah (ISAMM) Machine Learning 2024-2025 15 / 56
Maximizing the Likelihood III

Note that the two Gaussians have contours that are the same shape
and orientation, since they share a covariance matrix Σ
They have different means µ0 and µ1 .
The straight line giving the decision boundary at which
p(y = 1|x) = 0.5. On one side of the boundary, we’ll predict y = 1 to
be the most likely outcome, and on the other side, we’ll predict y = 0.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 16 / 56


Multiclass Classification I

In multiclass classification, GDA can be extended to handle more than


two classes. For K classes, the model assumes:
Each class k (where k = 1, 2, . . . , K ) has its own mean vector µk .
All classes share the same covariance matrix Σ.
The class prior probabilities p(y = k) are modeled using a multinomial
distribution.
The model is defined as:

y ∼ Multinomial(ϕ1 , ϕ2 , . . . , ϕK ),

x|y = k ∼ N (µk , Σ),


where ϕk = p(y = k) and K
P
k=1 ϕk = 1.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 17 / 56


Key Differences Between Binary and Multiclass GDA

Aspect Binary Classification Multiclass Classification


Number of Classes 2 classes (y = 0 and y = 1) K classes (y = 1, 2, . . . , K )
Class Prior Bernoulli distribution Multinomial distribution
Mean Vectors µ0 and µ1 µ1 , µ 2 , . . . , µ K
Covariance Matrix Shared Σ for both classes Shared Σ for all classes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 18 / 56


GDA for Multiclass Classification

Training:
Estimate the class prior probabilities ϕk = p(y = k) for each class.
Estimate the mean vector µk for each class.
Estimate the shared covariance matrix Σ.
Prediction:
For a new input x, compute the posterior probability p(y = k|x) for
each class using Bayes’ rule:

p(x|y = k)p(y = k)
p(y = k|x) = PK .
j=1 p(x|y = j)p(y = j)

Assign the class with the highest posterior probability:

ŷ = arg max p(y = k|x).


k

Mohamed Farah (ISAMM) Machine Learning 2024-2025 19 / 56


When to use GDA for Multiclass Classification

When the data for each class is approximately Gaussian.


When the number of features is not too large relative to the number
of samples (to avoid overfitting).
When interpretability and probabilistic outputs are important.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 20 / 56


Derivation of the Logistic Form in GDA I

Step 1: Assumptions of GDA

The class-conditional distributions p(x|y ) are Gaussian:


 
1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 ) (1)
(2π)d/2 |Σ|1/2 2
 
1 1 T −1
p(x|y = 1) = exp − (x − µ 1 ) Σ (x − µ 1 ) (2)
(2π)d/2 |Σ|1/2 2

The class priors are modeled as:

p(y = 1) = ϕ, p(y = 0) = 1 − ϕ (3)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 21 / 56


Derivation of the Logistic Form in GDA II

Step 2: Apply Bayes’ Theorem

p(x|y = 1)p(y = 1)
p(y = 1|x) = (4)
p(x)

where p(x) is the marginal distribution of x:

p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0).

Mohamed Farah (ISAMM) Machine Learning 2024-2025 22 / 56


Derivation of the Logistic Form in GDA III

Step 3: Put (1),(2),(3) in equation (4)

1
p(y = 1|x) = .
1 + exp (−θT x − θ0 )
where:
θ = Σ−1 (µ1 − µ0 ),
 
1 ϕ
θ0 = − (µT Σ−1 µ1 − µT −1
0 Σ µ0 ) + log .
2 1 1−ϕ
i.e. The expression simplifies to the logistic function
Conclusion
The posterior probability p(y = 1|x) in GDA can be expressed as a
logistic function under the assumption that p(x|y ) is Gaussian

Mohamed Farah (ISAMM) Machine Learning 2024-2025 23 / 56


GDA vs. Logistic Regression I

The GDA model (generative algorithm) has an interesting relationship


to logistic regression (discriminative algorithm)
If p(x|y ) is multivariate Gaussian (with shared Σ), then p(y |x)
necessarily follows a logistic function of the form

1
p(y = 1|x; ϕ, Σ, µ0 , µ1 ) = ,
1 + exp(−θT x)
The converse is not true; i.e., p(y |x) being a logistic function does
not imply p(x|y ) is multivariate Gaussian

Note
There are many different sets of assumptions that would lead to p(y |x)
taking the form of a logistic function.
For example, if x|y = 0 ∼ Poisson(λ0 ), and x|y = 1 ∼ Poisson(λ1 ), then
p(y |x) will be logistic.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 24 / 56


GDA vs. Logistic Regression II

Using GDA on non-Gaussian data would give less predictable results.


This shows that GDA makes stronger modeling assumptions about
the data than does logistic regression.
When these modeling assumptions are correct (p(x|y ) is Gaussian),
GDA is asymptotically efficient (the ”best” estimator in the limit of
very large training sets (large n) → better than logistic regression
(even for small training set sizes).
When these modeling assumptions are correct, GDA is more data
efficient (i.e., requires less training data to learn “well”)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 25 / 56


GDA vs. Logistic Regression III

Logistic regression makes fewer assumptions and is more robust and


less sensitive to incorrect modeling assumptions.
When the data is indeed non-Gaussian, then in the limit of large
datasets, logistic regression will almost always do better than GDA.
In practice, logistic regression is often preferred due to its robustness.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 26 / 56


Naive Bayes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 27 / 56


Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ theorem


with the ”naive” assumption of conditional independence
between features.
It can use different event models depending on the type of data:
Bernoulli event model: For binary features.
Multinomial event model: For discrete count-based features.
Gaussian event model: For continuous features.
Example applications: Email spam classification, sentiment analysis,
medical diagnosis.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 28 / 56


Naive Bayes Assumption

Naive Bayes (NB) assumption: Features xj are conditionally


independent given the class y .
Model = Naive Bayes classifier:
d
Y
p(x1 , . . . , xd |y ) = p(xj |y )
j=1

Parameters: ϕyk , ϕj|y =k , j = 1, . . . , d; k = 1, . . . , K .

Mohamed Farah (ISAMM) Machine Learning 2024-2025 29 / 56


Naive Bayes Event Models I

The choice of event model depends on the type of data:


1. Bernoulli Event Model
Used for binary features (e.g., presence or absence of words in text).
Notation:
xj |y ∼ Bernoulli(ϕj|y )
where:
xj is a binary feature (0 or 1).
ϕj|y is the probability of xj = 1 given class y .
Probability model:
x
p(xj |y ) = ϕj|yj (1 − ϕj|y )1−xj

Example: Email spam classification with binary word features.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 30 / 56


Naive Bayes Event Models II

2. Multinomial Event Model


Used for discrete count-based features (e.g., word counts in text).
Notation:

xj |y ∼ Multinomial(ϕj,1|y , ϕj,2|y , . . . , ϕj,nj |y )

where:
xj is a discrete feature representing counts.
ϕj,k|y is the probability of feature xj taking value k given class y .
nj is the number of possible values for feature xj .
Probability model:

count of feature j in class y


p(xj |y ) =
total count of all features in class y

Example: Document classification with word frequency features.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 31 / 56


Naive Bayes Event Models III

3. Gaussian Event Model


Used for continuous features (e.g., height, weight, temperature).
Notation:
2
xj |y ∼ N (µj|y , σj|y )
where:
xj is a continuous feature.
µj|y is the mean of feature xj given class y .
2
σj|y is the variance of feature xj given class y .
Probability model:
!
1 (xj − µj|y )2
p(xj |y ) = q exp − 2
2
2πσj|y 2σj|y

Example: Medical diagnosis with continuous lab test results.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 32 / 56


Maximum Likelihood Estimator (MLE)
Statistical Approach

Mohamed Farah (ISAMM) Machine Learning 2024-2025 33 / 56


Naive Bayes Parameter Estimation I

The Maximum Likelihood Estimator :

θMLE = arg max P(D|θ)


θ

Given a training set {(x (i) , y (i) ); i = 1, . . . , n}, the joint likelihood of
the data D is:
n
Y
L(D; ϕyk , ϕj|y =k ) = p(x (i) , y (i) ).
i=1

Mohamed Farah (ISAMM) Machine Learning 2024-2025 34 / 56


Naive Bayes Parameter Estimation II

Maximum likelihood estimates depend on the event model:

Bernoulli Model:

1{xj (i)
Pn
i=1 = 1 ∧ y (i) = k}
ϕj|y =k =
1{y (i) = k}
Pn
i=1

1{y (i) = 1}
Pn
i=1
ϕy1 =
n

Mohamed Farah (ISAMM) Machine Learning 2024-2025 35 / 56


Naive Bayes Parameter Estimation III
Proof:
1 Bernoulli Model Setup:

P(xj = 1|y = k) = ϕj|y =k .

2 Likelihood Function:
n (i)
Y x (i)
L(ϕj|y =k ) = ϕj|yj =k (1 − ϕj|y =k )1−xj .
i=1

3 Log-Likelihood Function:
n h i
(i) (i)
X
ℓ(ϕj|y =k ) = xj log ϕj|y =k + (1 − xj ) log(1 − ϕj|y =k ) .
i=1

4 Maximizing the Log-Likelihood:


n
" (i) (i)
#
∂ℓ(ϕj|y =k ) X xj 1 − xj
= − = 0.
∂ϕj|y =k ϕj|y =k 1 − ϕj|y =k
i=1
Mohamed Farah (ISAMM) Machine Learning 2024-2025 36 / 56
Naive Bayes Parameter Estimation IV

5 Multiply through by ϕj|y =k (1 − ϕj|y =k ):


n h i
(i) (i)
X
xj (1 − ϕj|y =k ) − (1 − xj )ϕj|y =k = 0.
i=1

6 Expand and Simplify:


n h i
(i) (i) (i)
X
xj − xj ϕj|y =k − ϕj|y =k + xj ϕj|y =k = 0.
i=1

7 Cancel Terms:
n h i
(i)
X
xj − ϕj|y =k = 0.
i=1

8 Solve for ϕj|y =k :


Pn (i)
i=1 xj
ϕj|y =k = Pn .
i=1 1
Mohamed Farah (ISAMM) Machine Learning 2024-2025 37 / 56
Naive Bayes Parameter Estimation V
9 Condition on Class y = k: Restrict the sum to only those data points
where y (i) = k:

i=1 1{y
Pn (i) (i)
= k}xj
ϕj|y =k = Pn .
i=1 1{y
(i) = k}

(i)
10 Final Result: Since xj is binary,
1 (i)
{xj =1∧y (i)
= k} = 1{y (i) = k}xj . Thus:
(i)

1{x (i)
Pn
= 1 ∧ y (i) = k}
ϕj|y =k =
i=1
Pn j .
i=1 1{y (i) = k}

Intuition:
The numerator counts the number of times xj = 1 when y = k.
The denominator counts the total number of times y = k.
The ratio gives the empirical probability of xj = 1 given y = k, which is
the MLE for ϕj|y =k .
Mohamed Farah (ISAMM) Machine Learning 2024-2025 38 / 56
Naive Bayes Parameter Estimation VI
Multinomial Model:

· 1{y (i) = k}
Pn (i)
i=1 xj
ϕj|y =k = P
n
1{y (i) = k} · dj=1 xj
(i)
P
i=1

1{y (i) = k}
Pn
i=1
ϕyk =
n
Gaussian Model:

· 1{y (i) = k}
Pn (i)
i=1 xj
µj|y =k =
1{y (i) = k}
Pn
i=1

− µj|y =k )2 · 1{y (i) = k}


Pn (i)
2 i=1 (xj
σj|y =
i=1 1{y
=k Pn (i) = k}

These estimates are derived based on the type of data and the chosen
event model.
Mohamed Farah (ISAMM) Machine Learning 2024-2025 39 / 56
Inference using Naive Bayes Classifier

To make a prediction on a new example with features x, we calculate the


posterior probability:

p(x|y = k)p(y = k)
p(y = k|x) =
p(x)
Q 
d
j=1 p(x j |y = k) p(y = k)
=P hQ  i
K d
k=1 j=1 p(x j |y = k) p(y = k)

and pick the class with the highest posterior probability.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 40 / 56


Laplace Smoothing I

The maximum likelihood estimates of random variables z are given


by
1{z (i) = j}
Pn
ϕj = i=1 .
n
Problem: Zero probabilities for unseen events:
Qd
If ∃j : ϕj = 0, then p(z1 , . . . , zd ) = j=1 p(zj ) = 0.
Not possible to make a prediction.
Solution: Laplace smoothing: Replace the above estimate with

1 + ni=1 1{z (i) = j}


P
ϕj = .
k +n
ϕj ̸= 0 for all values of j, solving our problem of probabilities being
estimated as zero, even for unseen events.
Under certain (arguably quite strong) conditions, it can be shown that
the Laplace smoothing actually gives the optimal estimator of the ϕj .
Mohamed Farah (ISAMM) Machine Learning 2024-2025 41 / 56
Laplace Smoothing II
Maximum A posteriori estimates depend on the event model:
Bernoulli Model:

1 + ni=1 1{xj = 1 ∧ y (i) = k}


P (i)
ϕj|y =k =
2 + i=1 1{y (i) = k}
Pn

1{y (i) = 1}
Pn
1+ i=1
ϕy =
2+n
Multinomial Model:

1 + ni=1 xj · 1{y (i) = k}


P (i)
ϕj|y =k =
nj + ni=1 1{y (i) = k} · dj=1 xj
P P (i)

1{y (i) = k}
Pn
1+ i=1
ϕyk =
K +n

Mohamed Farah (ISAMM) Machine Learning 2024-2025 42 / 56


Bayesian Approach for Parameter Estimation

Mohamed Farah (ISAMM) Machine Learning 2024-2025 43 / 56


Maximum A Posteriori (MAP) Estimator I

Bayesian Framework:
The MAP estimator maximizes the posterior distribution:

θMAP = arg max P(θ|D) = arg max P(D|θ)P(θ).


θ θ

MAP for Bernoulli Model:


For the Bernoulli model, the parameter ϕj|y =k represents the probability of
feature xj = 1 given class y = k.
Likelihood:
n (i)
Y x (i)
P(D|ϕj|y =k ) = ϕj|yj =k (1 − ϕj|y =k )1−xj .
i=1

Mohamed Farah (ISAMM) Machine Learning 2024-2025 44 / 56


Maximum A Posteriori (MAP) Estimator II

Prior (Beta Distribution):

ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)

where:
α and β are the hyperparameters of the Beta prior.
B(·, ·) is the Beta function:
Z 1
B(a, b) = t a−1 (1 − t)b−1 dt.
0

Posterior:
Pn (i) (i)
xj +α−1 Pn
P(ϕj|y =k |D) ∝ ϕj|yi=1 (1 − ϕj|y =k ) i=1 (1−xj )+β−1 .
=k

Mohamed Farah (ISAMM) Machine Learning 2024-2025 45 / 56


Maximum A Posteriori (MAP) Estimator III

MAP Estimate:

1{xj(i) = 1 ∧ y (i) = k} + α − 1
Pn
i=1
ϕMAP = .
i=1 1{y
j|y =k Pn (i) = k} + α + β − 2

Intuition:
The MAP estimate combines the observed data (likelihood) with prior
knowledge (prior).
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the MAP estimate
reduces to the MLE.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 46 / 56


Expectation A Posteriori (EAP) Estimator I

Bayesian Framework:
The EAP estimator computes the expected value of the posterior
distribution:
Z
θEAP = E[θ|D] = θ · P(θ|D) dθ.

EAP for Bernoulli Model:


For the Bernoulli model, the parameter ϕj|y =k represents the probability of
feature xj = 1 given class y = k.
Likelihood:
n (i)
Y x (i)
P(D|ϕj|y =k ) = ϕj|yj =k (1 − ϕj|y =k )1−xj .
i=1

Mohamed Farah (ISAMM) Machine Learning 2024-2025 47 / 56


Expectation A Posteriori (EAP) Estimator II

Prior (Beta Distribution):

ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)

Posterior: The posterior distribution P(ϕj|y =k |D) is a Beta


distribution:
n n
!
(i) (i)
X X
ϕj|y =k |D ∼ Beta α + xj , β + (1 − xj )
i=1 i=1

i.e.
Pn (i)
α+ x −1 Pn (i)
ϕj|y =ki=1 j (1 − ϕj|y =k )β+ i=1 (1−xj )−1
P(ϕj|y =k |D) =  
(i) (i)
B α + ni=1 xj , β + ni=1 (1 − xj )
P P

Mohamed Farah (ISAMM) Machine Learning 2024-2025 48 / 56


Expectation A Posteriori (EAP) Estimator III

EAP Estimate: The EAP estimate is the expected value of the


posterior:

1{xj(i) = 1 ∧ y (i) = k}
Pn
α+ i=1
ϕEAP = .
j|y =k
α + β + ni=1 1{y (i) = k}
P

Intuition:
The EAP estimate averages over the entire posterior distribution,
providing a more robust estimate than the MAP.
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the EAP estimate
reduces to the MLE with Laplace smoothing.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 49 / 56


Beta and Dirichlet Distributions

Mohamed Farah (ISAMM) Machine Learning 2024-2025 50 / 56


Beta Distribution I

The Beta distribution is a continuous probability distribution defined on


[0, 1] with PDF:

x α−1 (1 − x)β−1
P(x|α, β) = ,
B(α, β)
where:
x ∈ [0, 1],
α > 0 and β > 0 are the shape parameters,
B(α, β) is the Beta function:
Z 1
B(α, β) = t α−1 (1 − t)β−1 dt.
0

Mohamed Farah (ISAMM) Machine Learning 2024-2025 51 / 56


Beta Distribution II

Key Properties:
α
Mean: E[x] = α+β .
αβ
Variance: Var(x) = (α+β)2 (α+β+1)
.
Conjugate prior for the Bernoulli and Binomial distributions, i.e. if the
prior is Beta and the likelihood is Bernoulli/Binomial, the posterior is
also Beta.
Applications:
Modeling probabilities in binary events (e.g., success/failure,
heads/tails).
Bayesian inference for proportions.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 52 / 56


Dirichlet Distribution I

The Dirichlet distribution is a multivariate generalization of the Beta


distribution with PDF:

K
1 Y αk −1
P(x|α) = xk ,
B(α)
k=1

where:
PK
x = (x1 , x2 , . . . , xK ) is a probability vector ( k=1 xk = 1). x lies on
the (K − 1)-dimensional simplex,
α = (α1 , α2 , . . . , αK ) are the shape parameters,

Mohamed Farah (ISAMM) Machine Learning 2024-2025 53 / 56


Dirichlet Distribution II

B(α) is the multivariate Beta function:


QK
Γ(αk )
B(α) = k=1
PK .
Γ k=1 α k

where Γ(z) is the Gamma function which is a generalization of the


factorial function to real and complex numbers z (with Re(z) > 0) :
Z ∞
Γ(z) = t z−1 e −t dt.
0

Mohamed Farah (ISAMM) Machine Learning 2024-2025 54 / 56


Dirichlet Distribution III

Key Properties:
α
Mean: E[xk ] = PK k .
i=1 αi
αk (α0 −αk ) PK
Variance: Var(xk ) = α20 (α0 +1)
, where α0 = i=1 αi .
α −1
Mode: Mode(θi ) = Pk i for αi > 1
j=1 αj −k

Conjugate prior for the Multinomial distribution, i.e. if the prior is


Dirichlet and the likelihood is Multinomial, the posterior is also
Dirichlet.
Applications:
Modeling probabilities over multiple categories (e.g., dice rolls, topic
modeling).
Bayesian inference for categorical data.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 55 / 56


Relationship Between Beta and Dirichlet Distributions

The Beta distribution is a special case of the Dirichlet distribution when


K = 2:

x1α1 −1 x2α2 −1
P(x1 , x2 |α1 , α2 ) = .
B(α1 , α2 )
Letting x1 = x and x2 = 1 − x, this reduces to the Beta distribution:

x α1 −1 (1 − x)α2 −1
P(x|α1 , α2 ) = .
B(α1 , α2 )

Mohamed Farah (ISAMM) Machine Learning 2024-2025 56 / 56


Conclusion

Generative learning algorithms model p(x|y ) and p(y ).


GDA assumes Gaussian distribution for p(x|y ).
Naive Bayes assumes conditional independence of features.
Laplace smoothing helps avoid zero probabilities.
Even though the Naive Bayes assumption is an extremely strong
assumption, the NB algorithm works well on many problems.
It is simple, fast, and performs surprisingly well in practice, especially
for text classification and other high-dimensional datasets.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 57 / 56

You might also like