0% found this document useful (0 votes)

7 views57 pages

generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c

The document provides an overview of Generative Learning Algorithms, focusing on Gaussian Discriminant Analysis (GDA) and Naive Bayes. It explains the differences between generative and discriminative algorithms, the mathematical foundations of GDA, and how it can be applied to both binary and multiclass classification problems. Additionally, it discusses the relationship between GDA and logistic regression, highlighting the conditions under which GDA can be expressed as a logistic function.

Uploaded by

achouriarij59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views57 pages

generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c

Uploaded by

achouriarij59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Machine Learning

Generative Learning Algorithms

Mohamed Farah

ISAMM

2024-2025

Mohamed Farah (ISAMM) Machine Learning 2024-2025 1 / 56

Outline

1 Introduction to Generative Learning Algorithms

2 Gaussian Discriminant Analysis (GDA)

3 Naive Bayes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 2 / 56

Introduction to Generative Learning Algorithms

Mohamed Farah (ISAMM) Machine Learning 2024-2025 3 / 56

Generative vs. Discriminative Learning I

Discriminative Algorithms: Model p(y |x; θ) directly.

Focus on finding a decision boundary to separate classes.
Examples: Logistic Regression, Perceptron.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 4 / 56

Generative vs. Discriminative Learning II

Generative Algorithms: Model p(x|y ) and p(y ) then derive the

posterior distribution on y given x using the Bayes rule:

p(x|y )p(y )
p(y |x) = .
p(x)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 5 / 56

Generative vs. Discriminative Learning III

Focus on modeling the distribution of each class p(x|y ) as well as p(y )

(called the class priors).
In order to make a prediction, choose

p(x|y )p(y )
y ∗ = arg max p(y |x) = arg max
y y p(x)
= arg max p(x|y )p(y ).
y

We don’t actually need to calculate the denominator p(x)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 6 / 56

Gaussian Discriminant Analysis (GDA)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 7 / 56

Multivariate Normal Distribution I
Also called the multivariate Gaussian distribution
Density function:

1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)n/2 |Σ|1/2 2
Mean: Z
E[X ] = x p(x; µ, Σ) dx = µ.
x
µ ∈ Rd
Covariance of a vector-valued random variable x :

Cov(X ) =E[(X − E[X ])(X − E[X ])T ]

= E[X X T ] − (E[X ])(E[X ])T
= Σ
Σ ∈ Rd×d , where Σ ≥ 0 is symmetric and positive semi-definite
Mohamed Farah (ISAMM) Machine Learning 2024-2025 8 / 56
Multivariate Normal Distribution II

Examples of the density of a Gaussian distribution:

0 1 0 0 0
N (µ = , Σ == = I) N (µ = , Σ = 0.6I ) N (µ = , Σ = 2I )
0 0 1 0 0

As Σ becomes larger, the Gaussian becomes more “spread-out,” and

as it becomes smaller, the distribution becomes more “compressed.”

Mohamed Farah (ISAMM) Machine Learning 2024-2025 9 / 56

Multivariate Normal Distribution III

Examples of the density of a Gaussian distribution:

0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1

As we increase the off-diagonal entry in Σ, the density becomes more

“compressed” towards the 45◦ line (given by x1 = x2 ).

Mohamed Farah (ISAMM) Machine Learning 2024-2025 10 / 56

Multivariate Normal Distribution IV
Examples of the density of a Gaussian distribution:

0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1

The contours of the same three densities

Mohamed Farah (ISAMM) Machine Learning 2024-2025 11 / 56
Gaussian Discriminant Analysis (GDA) I
We have a classification problem.
Start with a binary classification.
The input features x are continuous-valued random variables
We assume p(x|y ) is a multivariate normal distribution:
Model:
y ∼ Bernoulli(ϕ)
x|y = 0 ∼ N (µ0 , Σ)
x|y = 1 ∼ N (µ1 , Σ)
Parameters: ϕ, µ0 , µ1 and Σ
Distributions :

1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 ) ,
(2π)d/2 |Σ|1/2 2

1 1 T −1
p(x|y = 1) = exp − (x − µ1 ) Σ (x − µ1 ) .
(2π)d/2 |Σ|1/2 2
Mohamed Farah (ISAMM) Machine Learning 2024-2025 12 / 56
Deriving the GDA Model
Goal: Estimate parameters ϕ, µ0 , µ1 , Σ using maximum likelihood.
Likelihood of the data D:
n
Y
L(D; ϕ, µ0 , µ1 , Σ) = p(x (i) , y (i) ; ϕ, µ0 , µ1 , Σ)
i=1

Decompose into class-conditional probabilities:

Log-likelihood:
n
X n
X
(i) (i)
ℓ(D; ϕ, µ0 , µ1 , Σ) = log p(x |y )+ log p(y (i) )
i=1 i=1

Maximize with respect to ϕ, µ0 , µ1 , Σ.

Results:
n
1 X
ϕ= 1{y (i) = 1}
m
i=1

i=1 1{y
Pn (i) = 0}x (i)
µ0 = P
i=1 1{y
n (i) = 0}

i=1 1{y
Pn (i) = 1}x (i)
µ1 = P
i=1 1{y
n (i) = 1}

Mohamed Farah (ISAMM) Machine Learning 2024-2025 14 / 56

Maximizing the Likelihood II
n
1 X (i)
Σ= (x − µy (i) )(x (i) − µy (i) )T
m
i=1

What the algorithm is doing can be seen as follows:

The contours of the two Gaussian distributions that have been fit to
the data in each of the two classes
Mohamed Farah (ISAMM) Machine Learning 2024-2025 15 / 56
Maximizing the Likelihood III

Note that the two Gaussians have contours that are the same shape
and orientation, since they share a covariance matrix Σ
They have different means µ0 and µ1 .
The straight line giving the decision boundary at which
p(y = 1|x) = 0.5. On one side of the boundary, we’ll predict y = 1 to
be the most likely outcome, and on the other side, we’ll predict y = 0.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 16 / 56

Multiclass Classification I

In multiclass classification, GDA can be extended to handle more than

two classes. For K classes, the model assumes:
Each class k (where k = 1, 2, . . . , K ) has its own mean vector µk .
All classes share the same covariance matrix Σ.
The class prior probabilities p(y = k) are modeled using a multinomial
distribution.
The model is defined as:

y ∼ Multinomial(ϕ1 , ϕ2 , . . . , ϕK ),

x|y = k ∼ N (µk , Σ),

where ϕk = p(y = k) and K
P
k=1 ϕk = 1.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 17 / 56

Key Differences Between Binary and Multiclass GDA

Aspect Binary Classification Multiclass Classification

Number of Classes 2 classes (y = 0 and y = 1) K classes (y = 1, 2, . . . , K )
Class Prior Bernoulli distribution Multinomial distribution
Mean Vectors µ0 and µ1 µ1 , µ 2 , . . . , µ K
Covariance Matrix Shared Σ for both classes Shared Σ for all classes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 18 / 56

GDA for Multiclass Classification

Training:
Estimate the class prior probabilities ϕk = p(y = k) for each class.
Estimate the mean vector µk for each class.
Estimate the shared covariance matrix Σ.
Prediction:
For a new input x, compute the posterior probability p(y = k|x) for
each class using Bayes’ rule:

p(x|y = k)p(y = k)
p(y = k|x) = PK .
j=1 p(x|y = j)p(y = j)

Assign the class with the highest posterior probability:

ŷ = arg max p(y = k|x).

Mohamed Farah (ISAMM) Machine Learning 2024-2025 19 / 56

When to use GDA for Multiclass Classification

When the data for each class is approximately Gaussian.

When the number of features is not too large relative to the number
of samples (to avoid overfitting).
When interpretability and probabilistic outputs are important.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 20 / 56

Derivation of the Logistic Form in GDA I

Step 1: Assumptions of GDA

The class-conditional distributions p(x|y ) are Gaussian:

1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 ) (1)
(2π)d/2 |Σ|1/2 2

1 1 T −1
p(x|y = 1) = exp − (x − µ 1 ) Σ (x − µ 1 ) (2)
(2π)d/2 |Σ|1/2 2

The class priors are modeled as:

p(y = 1) = ϕ, p(y = 0) = 1 − ϕ (3)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 21 / 56

Derivation of the Logistic Form in GDA II

Step 2: Apply Bayes’ Theorem

p(x|y = 1)p(y = 1)
p(y = 1|x) = (4)
p(x)

where p(x) is the marginal distribution of x:

p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0).

Mohamed Farah (ISAMM) Machine Learning 2024-2025 22 / 56

Derivation of the Logistic Form in GDA III

Step 3: Put (1),(2),(3) in equation (4)

1
p(y = 1|x) = .
1 + exp (−θT x − θ0 )
where:
θ = Σ−1 (µ1 − µ0 ),

1 ϕ
θ0 = − (µT Σ−1 µ1 − µT −1
0 Σ µ0 ) + log .
2 1 1−ϕ
i.e. The expression simplifies to the logistic function
Conclusion
The posterior probability p(y = 1|x) in GDA can be expressed as a
logistic function under the assumption that p(x|y ) is Gaussian

Mohamed Farah (ISAMM) Machine Learning 2024-2025 23 / 56

GDA vs. Logistic Regression I

The GDA model (generative algorithm) has an interesting relationship

to logistic regression (discriminative algorithm)
If p(x|y ) is multivariate Gaussian (with shared Σ), then p(y |x)
necessarily follows a logistic function of the form

1
p(y = 1|x; ϕ, Σ, µ0 , µ1 ) = ,
1 + exp(−θT x)
The converse is not true; i.e., p(y |x) being a logistic function does
not imply p(x|y ) is multivariate Gaussian

Note
There are many different sets of assumptions that would lead to p(y |x)
taking the form of a logistic function.
For example, if x|y = 0 ∼ Poisson(λ0 ), and x|y = 1 ∼ Poisson(λ1 ), then
p(y |x) will be logistic.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 24 / 56

GDA vs. Logistic Regression II

Using GDA on non-Gaussian data would give less predictable results.

This shows that GDA makes stronger modeling assumptions about
the data than does logistic regression.
When these modeling assumptions are correct (p(x|y ) is Gaussian),
GDA is asymptotically efficient (the ”best” estimator in the limit of
very large training sets (large n) → better than logistic regression
(even for small training set sizes).
When these modeling assumptions are correct, GDA is more data
efficient (i.e., requires less training data to learn “well”)

Mohamed Farah (ISAMM) Machine Learning 2024-2025 25 / 56

GDA vs. Logistic Regression III

Logistic regression makes fewer assumptions and is more robust and

less sensitive to incorrect modeling assumptions.
When the data is indeed non-Gaussian, then in the limit of large
datasets, logistic regression will almost always do better than GDA.
In practice, logistic regression is often preferred due to its robustness.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 26 / 56

Naive Bayes

Mohamed Farah (ISAMM) Machine Learning 2024-2025 27 / 56

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ theorem

with the ”naive” assumption of conditional independence
between features.
It can use different event models depending on the type of data:
Bernoulli event model: For binary features.
Multinomial event model: For discrete count-based features.
Gaussian event model: For continuous features.
Example applications: Email spam classification, sentiment analysis,
medical diagnosis.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 28 / 56

Naive Bayes Assumption

Naive Bayes (NB) assumption: Features xj are conditionally

independent given the class y .
Model = Naive Bayes classifier:
d
Y
p(x1 , . . . , xd |y ) = p(xj |y )
j=1

Parameters: ϕyk , ϕj|y =k , j = 1, . . . , d; k = 1, . . . , K .

Mohamed Farah (ISAMM) Machine Learning 2024-2025 29 / 56

Naive Bayes Event Models I

The choice of event model depends on the type of data:

Example: Email spam classification with binary word features.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 30 / 56

Naive Bayes Event Models II

2. Multinomial Event Model

Used for discrete count-based features (e.g., word counts in text).
Notation:

xj |y ∼ Multinomial(ϕj,1|y , ϕj,2|y , . . . , ϕj,nj |y )

where:
xj is a discrete feature representing counts.
ϕj,k|y is the probability of feature xj taking value k given class y .
nj is the number of possible values for feature xj .
Probability model:

count of feature j in class y

p(xj |y ) =
total count of all features in class y

Example: Document classification with word frequency features.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 31 / 56

Naive Bayes Event Models III

3. Gaussian Event Model

Example: Medical diagnosis with continuous lab test results.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 32 / 56

Maximum Likelihood Estimator (MLE)
Statistical Approach

Mohamed Farah (ISAMM) Machine Learning 2024-2025 33 / 56

Naive Bayes Parameter Estimation I

The Maximum Likelihood Estimator :

θMLE = arg max P(D|θ)

Given a training set {(x (i) , y (i) ); i = 1, . . . , n}, the joint likelihood of
the data D is:
n
Y
L(D; ϕyk , ϕj|y =k ) = p(x (i) , y (i) ).
i=1

Mohamed Farah (ISAMM) Machine Learning 2024-2025 34 / 56

Naive Bayes Parameter Estimation II

Maximum likelihood estimates depend on the event model:

Bernoulli Model:

1{xj (i)
Pn
i=1 = 1 ∧ y (i) = k}
ϕj|y =k =
1{y (i) = k}
Pn
i=1

1{y (i) = 1}
Pn
i=1
ϕy1 =
n

Mohamed Farah (ISAMM) Machine Learning 2024-2025 35 / 56

Naive Bayes Parameter Estimation III
Proof:
1 Bernoulli Model Setup:

P(xj = 1|y = k) = ϕj|y =k .

2 Likelihood Function:
n (i)
Y x (i)
L(ϕj|y =k ) = ϕj|yj =k (1 − ϕj|y =k )1−xj .
i=1

3 Log-Likelihood Function:
n h i
(i) (i)
X
ℓ(ϕj|y =k ) = xj log ϕj|y =k + (1 − xj ) log(1 − ϕj|y =k ) .
i=1

4 Maximizing the Log-Likelihood:

n
" (i) (i)
#
∂ℓ(ϕj|y =k ) X xj 1 − xj
= − = 0.
∂ϕj|y =k ϕj|y =k 1 − ϕj|y =k
i=1
Mohamed Farah (ISAMM) Machine Learning 2024-2025 36 / 56
Naive Bayes Parameter Estimation IV

5 Multiply through by ϕj|y =k (1 − ϕj|y =k ):

n h i
(i) (i)
X
xj (1 − ϕj|y =k ) − (1 − xj )ϕj|y =k = 0.
i=1

6 Expand and Simplify:

n h i
(i) (i) (i)
X
xj − xj ϕj|y =k − ϕj|y =k + xj ϕj|y =k = 0.
i=1

7 Cancel Terms:
n h i
(i)
X
xj − ϕj|y =k = 0.
i=1

8 Solve for ϕj|y =k :

Pn (i)
i=1 xj
ϕj|y =k = Pn .
i=1 1
Mohamed Farah (ISAMM) Machine Learning 2024-2025 37 / 56
Naive Bayes Parameter Estimation V
9 Condition on Class y = k: Restrict the sum to only those data points
where y (i) = k:

i=1 1{y
Pn (i) (i)
= k}xj
ϕj|y =k = Pn .
i=1 1{y
(i) = k}

(i)
10 Final Result: Since xj is binary,
1 (i)
{xj =1∧y (i)
= k} = 1{y (i) = k}xj . Thus:
(i)

1{x (i)
Pn
= 1 ∧ y (i) = k}
ϕj|y =k =
i=1
Pn j .
i=1 1{y (i) = k}

Intuition:
The numerator counts the number of times xj = 1 when y = k.
The denominator counts the total number of times y = k.
The ratio gives the empirical probability of xj = 1 given y = k, which is
the MLE for ϕj|y =k .
Mohamed Farah (ISAMM) Machine Learning 2024-2025 38 / 56
Naive Bayes Parameter Estimation VI
Multinomial Model:

· 1{y (i) = k}
Pn (i)
i=1 xj
ϕj|y =k = P
n
1{y (i) = k} · dj=1 xj
(i)
P
i=1

1{y (i) = k}
Pn
i=1
ϕyk =
n
Gaussian Model:

· 1{y (i) = k}
Pn (i)
i=1 xj
µj|y =k =
1{y (i) = k}
Pn
i=1

− µj|y =k )2 · 1{y (i) = k}

Pn (i)
2 i=1 (xj
σj|y =
i=1 1{y
=k Pn (i) = k}

These estimates are derived based on the type of data and the chosen
event model.
Mohamed Farah (ISAMM) Machine Learning 2024-2025 39 / 56
Inference using Naive Bayes Classifier

To make a prediction on a new example with features x, we calculate the

posterior probability:

p(x|y = k)p(y = k)
p(y = k|x) =
p(x)
Q
d
j=1 p(x j |y = k) p(y = k)
=P hQ i
K d
k=1 j=1 p(x j |y = k) p(y = k)

and pick the class with the highest posterior probability.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 40 / 56

Laplace Smoothing I

The maximum likelihood estimates of random variables z are given

by
1{z (i) = j}
Pn
ϕj = i=1 .
n
Problem: Zero probabilities for unseen events:
Qd
If ∃j : ϕj = 0, then p(z1 , . . . , zd ) = j=1 p(zj ) = 0.
Not possible to make a prediction.
Solution: Laplace smoothing: Replace the above estimate with

1 + ni=1 1{z (i) = j}

P
ϕj = .
k +n
ϕj ̸= 0 for all values of j, solving our problem of probabilities being
estimated as zero, even for unseen events.
Under certain (arguably quite strong) conditions, it can be shown that
the Laplace smoothing actually gives the optimal estimator of the ϕj .
Mohamed Farah (ISAMM) Machine Learning 2024-2025 41 / 56
Laplace Smoothing II
Maximum A posteriori estimates depend on the event model:
Bernoulli Model:

1 + ni=1 1{xj = 1 ∧ y (i) = k}

P (i)
ϕj|y =k =
2 + i=1 1{y (i) = k}
Pn

1{y (i) = 1}
Pn
1+ i=1
ϕy =
2+n
Multinomial Model:

1 + ni=1 xj · 1{y (i) = k}

P (i)
ϕj|y =k =
nj + ni=1 1{y (i) = k} · dj=1 xj
P P (i)

1{y (i) = k}
Pn
1+ i=1
ϕyk =
K +n

Mohamed Farah (ISAMM) Machine Learning 2024-2025 42 / 56

Bayesian Approach for Parameter Estimation

Mohamed Farah (ISAMM) Machine Learning 2024-2025 43 / 56

Maximum A Posteriori (MAP) Estimator I

Bayesian Framework:
The MAP estimator maximizes the posterior distribution:

θMAP = arg max P(θ|D) = arg max P(D|θ)P(θ).

θ θ

MAP for Bernoulli Model:

Mohamed Farah (ISAMM) Machine Learning 2024-2025 44 / 56

Maximum A Posteriori (MAP) Estimator II

Prior (Beta Distribution):

ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)

where:
α and β are the hyperparameters of the Beta prior.
B(·, ·) is the Beta function:
Z 1
B(a, b) = t a−1 (1 − t)b−1 dt.
0

Posterior:
Pn (i) (i)
xj +α−1 Pn
P(ϕj|y =k |D) ∝ ϕj|yi=1 (1 − ϕj|y =k ) i=1 (1−xj )+β−1 .
=k

Mohamed Farah (ISAMM) Machine Learning 2024-2025 45 / 56

Maximum A Posteriori (MAP) Estimator III

MAP Estimate:

1{xj(i) = 1 ∧ y (i) = k} + α − 1
Pn
i=1
ϕMAP = .
i=1 1{y
j|y =k Pn (i) = k} + α + β − 2

Intuition:
The MAP estimate combines the observed data (likelihood) with prior
knowledge (prior).
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the MAP estimate
reduces to the MLE.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 46 / 56

Expectation A Posteriori (EAP) Estimator I

Bayesian Framework:
The EAP estimator computes the expected value of the posterior
distribution:
Z
θEAP = E[θ|D] = θ · P(θ|D) dθ.

EAP for Bernoulli Model:

Mohamed Farah (ISAMM) Machine Learning 2024-2025 47 / 56

Expectation A Posteriori (EAP) Estimator II

Prior (Beta Distribution):

ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)

Posterior: The posterior distribution P(ϕj|y =k |D) is a Beta

distribution:
n n
!
(i) (i)
X X
ϕj|y =k |D ∼ Beta α + xj , β + (1 − xj )
i=1 i=1

i.e.
Pn (i)
α+ x −1 Pn (i)
ϕj|y =ki=1 j (1 − ϕj|y =k )β+ i=1 (1−xj )−1
P(ϕj|y =k |D) =
(i) (i)
B α + ni=1 xj , β + ni=1 (1 − xj )
P P

Mohamed Farah (ISAMM) Machine Learning 2024-2025 48 / 56

Expectation A Posteriori (EAP) Estimator III

EAP Estimate: The EAP estimate is the expected value of the

posterior:

1{xj(i) = 1 ∧ y (i) = k}
Pn
α+ i=1
ϕEAP = .
j|y =k
α + β + ni=1 1{y (i) = k}
P

Intuition:
The EAP estimate averages over the entire posterior distribution,
providing a more robust estimate than the MAP.
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the EAP estimate
reduces to the MLE with Laplace smoothing.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 49 / 56

Beta and Dirichlet Distributions

Mohamed Farah (ISAMM) Machine Learning 2024-2025 50 / 56

Beta Distribution I

The Beta distribution is a continuous probability distribution defined on

[0, 1] with PDF:

x α−1 (1 − x)β−1
P(x|α, β) = ,
B(α, β)
where:
x ∈ [0, 1],
α > 0 and β > 0 are the shape parameters,
B(α, β) is the Beta function:
Z 1
B(α, β) = t α−1 (1 − t)β−1 dt.
0

Mohamed Farah (ISAMM) Machine Learning 2024-2025 51 / 56

Beta Distribution II

Key Properties:
α
Mean: E[x] = α+β .
αβ
Variance: Var(x) = (α+β)2 (α+β+1)
.
Conjugate prior for the Bernoulli and Binomial distributions, i.e. if the
prior is Beta and the likelihood is Bernoulli/Binomial, the posterior is
also Beta.
Applications:
Modeling probabilities in binary events (e.g., success/failure,
heads/tails).
Bayesian inference for proportions.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 52 / 56

Dirichlet Distribution I

The Dirichlet distribution is a multivariate generalization of the Beta

distribution with PDF:

K
1 Y αk −1
P(x|α) = xk ,
B(α)
k=1

where:
PK
x = (x1 , x2 , . . . , xK ) is a probability vector ( k=1 xk = 1). x lies on
the (K − 1)-dimensional simplex,
α = (α1 , α2 , . . . , αK ) are the shape parameters,

Mohamed Farah (ISAMM) Machine Learning 2024-2025 53 / 56

Dirichlet Distribution II

B(α) is the multivariate Beta function:

QK
Γ(αk )
B(α) = k=1
PK .
Γ k=1 α k

where Γ(z) is the Gamma function which is a generalization of the

factorial function to real and complex numbers z (with Re(z) > 0) :
Z ∞
Γ(z) = t z−1 e −t dt.
0

Mohamed Farah (ISAMM) Machine Learning 2024-2025 54 / 56

Dirichlet Distribution III

Key Properties:
α
Mean: E[xk ] = PK k .
i=1 αi
αk (α0 −αk ) PK
Variance: Var(xk ) = α20 (α0 +1)
, where α0 = i=1 αi .
α −1
Mode: Mode(θi ) = Pk i for αi > 1
j=1 αj −k

Conjugate prior for the Multinomial distribution, i.e. if the prior is

Dirichlet and the likelihood is Multinomial, the posterior is also
Dirichlet.
Applications:
Modeling probabilities over multiple categories (e.g., dice rolls, topic
modeling).
Bayesian inference for categorical data.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 55 / 56

Relationship Between Beta and Dirichlet Distributions

The Beta distribution is a special case of the Dirichlet distribution when

K = 2:

x1α1 −1 x2α2 −1
P(x1 , x2 |α1 , α2 ) = .
B(α1 , α2 )
Letting x1 = x and x2 = 1 − x, this reduces to the Beta distribution:

x α1 −1 (1 − x)α2 −1
P(x|α1 , α2 ) = .
B(α1 , α2 )

Mohamed Farah (ISAMM) Machine Learning 2024-2025 56 / 56

Conclusion

Generative learning algorithms model p(x|y ) and p(y ).

GDA assumes Gaussian distribution for p(x|y ).
Naive Bayes assumes conditional independence of features.
Laplace smoothing helps avoid zero probabilities.
Even though the Naive Bayes assumption is an extremely strong
assumption, the NB algorithm works well on many problems.
It is simple, fast, and performs surprisingly well in practice, especially
for text classification and other high-dimensional datasets.

Mohamed Farah (ISAMM) Machine Learning 2024-2025 57 / 56

Pattern Recognition - Theodoridis Koutroumbas
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
641 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
cs229-notes2
No ratings yet
cs229-notes2
14 pages
(IT413P) Pattern Recognition Grade Four DR: Nagham Mekky
No ratings yet
(IT413P) Pattern Recognition Grade Four DR: Nagham Mekky
30 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Generative Models For Classification Neural Networks
No ratings yet
Generative Models For Classification Neural Networks
43 pages
MAP Classifier For Normal Distributions: Alexander Wong SYDE 372
No ratings yet
MAP Classifier For Normal Distributions: Alexander Wong SYDE 372
54 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
Intro MJJ
No ratings yet
Intro MJJ
29 pages
Machine Learning Models
0% (1)
Machine Learning Models
16 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Lec5 Part1
No ratings yet
Lec5 Part1
42 pages
Unit 2 - Gaussian Models
No ratings yet
Unit 2 - Gaussian Models
67 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Multidimensional Gaussian Distribution
No ratings yet
Multidimensional Gaussian Distribution
99 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
Lecture_6 (2)
No ratings yet
Lecture_6 (2)
12 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Pattern Analysis-Machine Learning
No ratings yet
Pattern Analysis-Machine Learning
74 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
4Gaussian Discriminant
No ratings yet
4Gaussian Discriminant
50 pages
DSA5102_lecture10
No ratings yet
DSA5102_lecture10
40 pages
02-Classification - Commented2
No ratings yet
02-Classification - Commented2
22 pages
L05-NaiveBayes
No ratings yet
L05-NaiveBayes
21 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Gaussian Mixture Model
No ratings yet
Gaussian Mixture Model
48 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
Lec 9
No ratings yet
Lec 9
15 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Mixture of Gaussians: CS229: Machine Learning Carlos Guestrin
No ratings yet
Mixture of Gaussians: CS229: Machine Learning Carlos Guestrin
18 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Asdad
No ratings yet
Asdad
14 pages
98867064
No ratings yet
98867064
53 pages
Lect 13 - Bayes Decistion Theory - Derivation
No ratings yet
Lect 13 - Bayes Decistion Theory - Derivation
25 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Generative Learning algorithims 1233
No ratings yet
Generative Learning algorithims 1233
33 pages
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
No ratings yet
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
21 pages
Dl Highlights
No ratings yet
Dl Highlights
6 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
1906.02590v1
No ratings yet
1906.02590v1
16 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Linear Discriminat Analysis
No ratings yet
Linear Discriminat Analysis
23 pages
Bayesian
No ratings yet
Bayesian
21 pages
Generative Algorithms
No ratings yet
Generative Algorithms
3 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
Lec 13
No ratings yet
Lec 13
16 pages
summerscales_mclc2009
No ratings yet
summerscales_mclc2009
7 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
Habibi Et Al 2020 Classifiers Preprint
No ratings yet
Habibi Et Al 2020 Classifiers Preprint
35 pages
E-Book - Statistical Description of Data
No ratings yet
E-Book - Statistical Description of Data
47 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
Student Academic Performance Prediction Using Supervised Learning Techniques
No ratings yet
Student Academic Performance Prediction Using Supervised Learning Techniques
13 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
61 pages
Final Year Project Proposal: Heart Attack Predictor Using Artificial Intelligence
No ratings yet
Final Year Project Proposal: Heart Attack Predictor Using Artificial Intelligence
37 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Classification and Visualization: Twitter Sentiment Analysis of Malaysia's Private Hospitals
No ratings yet
Classification and Visualization: Twitter Sentiment Analysis of Malaysia's Private Hospitals
10 pages
CyberbullyingDetection - Documentation
No ratings yet
CyberbullyingDetection - Documentation
12 pages
Breast Cancer Aiml Project
No ratings yet
Breast Cancer Aiml Project
25 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
53 pages
Machine Learning-Based Real-Time Sensor Drift Fault Detection Using Raspberry Pi
No ratings yet
Machine Learning-Based Real-Time Sensor Drift Fault Detection Using Raspberry Pi
7 pages
Murali - EM Search
No ratings yet
Murali - EM Search
1 page
Time Table Scheduling in Data Mining
No ratings yet
Time Table Scheduling in Data Mining
61 pages
Plant Disease Detection
No ratings yet
Plant Disease Detection
6 pages
A Novel Hybrid Data Balancing and Fraud Detection Approach For Automobile Insurance Claims
No ratings yet
A Novel Hybrid Data Balancing and Fraud Detection Approach For Automobile Insurance Claims
30 pages
Question Bank of Advanced Dbms
No ratings yet
Question Bank of Advanced Dbms
2 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
M.Tech.: Data Science & Engineering
No ratings yet
M.Tech.: Data Science & Engineering
17 pages
Education: About Me I B T M
No ratings yet
Education: About Me I B T M
1 page
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
No ratings yet
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
4 pages
Self-Quiz Unit 4 - Attempt Review
No ratings yet
Self-Quiz Unit 4 - Attempt Review
7 pages
Thesis
No ratings yet
Thesis
45 pages
Jurnal Internasional
No ratings yet
Jurnal Internasional
6 pages
A Machine Learning Framework For Domain Generation Algorithm (DGA) - Based Malware Detection
No ratings yet
A Machine Learning Framework For Domain Generation Algorithm (DGA) - Based Malware Detection
18 pages
Fake News Detection Using Python and Machine Learning
No ratings yet
Fake News Detection Using Python and Machine Learning
6 pages
Helmet Detection On Two-Wheeler Riders Using Machine Learning
No ratings yet
Helmet Detection On Two-Wheeler Riders Using Machine Learning
4 pages
Lab 2
100% (1)
Lab 2
4 pages
Automatic Detection of Online Abuse Final
No ratings yet
Automatic Detection of Online Abuse Final
19 pages
Credit Card Score Prediction Using Machine Learning
No ratings yet
Credit Card Score Prediction Using Machine Learning
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Hexagon Number Sense
From Everand
Hexagon Number Sense
Christopher Casey
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)