generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c
generative-learning-algorithms_dd98634b79c983198e4f9069a9ca258c
Mohamed Farah
ISAMM
2024-2025
3 Naive Bayes
p(x|y )p(y )
p(y |x) = .
p(x)
p(x|y )p(y )
y ∗ = arg max p(y |x) = arg max
y y p(x)
= arg max p(x|y )p(y ).
y
0 1 0 0 0
N (µ = , Σ == = I) N (µ = , Σ = 0.6I ) N (µ = , Σ = 2I )
0 0 1 0 0
0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1
0 1 0 0 1 0.5 0 1 0.8
N (µ = ,Σ = ) N (µ = ,Σ = ) N (µ = ,Σ =
0 0 1 0 0.5 1 0 0.8 1
Log-likelihood:
n
X n
X
(i) (i)
ℓ(D; ϕ, µ0 , µ1 , Σ) = log p(x |y )+ log p(y (i) )
i=1 i=1
i=1 1{y
Pn (i) = 0}x (i)
µ0 = P
i=1 1{y
n (i) = 0}
i=1 1{y
Pn (i) = 1}x (i)
µ1 = P
i=1 1{y
n (i) = 1}
The contours of the two Gaussian distributions that have been fit to
the data in each of the two classes
Mohamed Farah (ISAMM) Machine Learning 2024-2025 15 / 56
Maximizing the Likelihood III
Note that the two Gaussians have contours that are the same shape
and orientation, since they share a covariance matrix Σ
They have different means µ0 and µ1 .
The straight line giving the decision boundary at which
p(y = 1|x) = 0.5. On one side of the boundary, we’ll predict y = 1 to
be the most likely outcome, and on the other side, we’ll predict y = 0.
y ∼ Multinomial(ϕ1 , ϕ2 , . . . , ϕK ),
Training:
Estimate the class prior probabilities ϕk = p(y = k) for each class.
Estimate the mean vector µk for each class.
Estimate the shared covariance matrix Σ.
Prediction:
For a new input x, compute the posterior probability p(y = k|x) for
each class using Bayes’ rule:
p(x|y = k)p(y = k)
p(y = k|x) = PK .
j=1 p(x|y = j)p(y = j)
p(x|y = 1)p(y = 1)
p(y = 1|x) = (4)
p(x)
1
p(y = 1|x) = .
1 + exp (−θT x − θ0 )
where:
θ = Σ−1 (µ1 − µ0 ),
1 ϕ
θ0 = − (µT Σ−1 µ1 − µT −1
0 Σ µ0 ) + log .
2 1 1−ϕ
i.e. The expression simplifies to the logistic function
Conclusion
The posterior probability p(y = 1|x) in GDA can be expressed as a
logistic function under the assumption that p(x|y ) is Gaussian
1
p(y = 1|x; ϕ, Σ, µ0 , µ1 ) = ,
1 + exp(−θT x)
The converse is not true; i.e., p(y |x) being a logistic function does
not imply p(x|y ) is multivariate Gaussian
Note
There are many different sets of assumptions that would lead to p(y |x)
taking the form of a logistic function.
For example, if x|y = 0 ∼ Poisson(λ0 ), and x|y = 1 ∼ Poisson(λ1 ), then
p(y |x) will be logistic.
where:
xj is a discrete feature representing counts.
ϕj,k|y is the probability of feature xj taking value k given class y .
nj is the number of possible values for feature xj .
Probability model:
Given a training set {(x (i) , y (i) ); i = 1, . . . , n}, the joint likelihood of
the data D is:
n
Y
L(D; ϕyk , ϕj|y =k ) = p(x (i) , y (i) ).
i=1
Bernoulli Model:
1{xj (i)
Pn
i=1 = 1 ∧ y (i) = k}
ϕj|y =k =
1{y (i) = k}
Pn
i=1
1{y (i) = 1}
Pn
i=1
ϕy1 =
n
2 Likelihood Function:
n (i)
Y x (i)
L(ϕj|y =k ) = ϕj|yj =k (1 − ϕj|y =k )1−xj .
i=1
3 Log-Likelihood Function:
n h i
(i) (i)
X
ℓ(ϕj|y =k ) = xj log ϕj|y =k + (1 − xj ) log(1 − ϕj|y =k ) .
i=1
7 Cancel Terms:
n h i
(i)
X
xj − ϕj|y =k = 0.
i=1
i=1 1{y
Pn (i) (i)
= k}xj
ϕj|y =k = Pn .
i=1 1{y
(i) = k}
(i)
10 Final Result: Since xj is binary,
1 (i)
{xj =1∧y (i)
= k} = 1{y (i) = k}xj . Thus:
(i)
1{x (i)
Pn
= 1 ∧ y (i) = k}
ϕj|y =k =
i=1
Pn j .
i=1 1{y (i) = k}
Intuition:
The numerator counts the number of times xj = 1 when y = k.
The denominator counts the total number of times y = k.
The ratio gives the empirical probability of xj = 1 given y = k, which is
the MLE for ϕj|y =k .
Mohamed Farah (ISAMM) Machine Learning 2024-2025 38 / 56
Naive Bayes Parameter Estimation VI
Multinomial Model:
· 1{y (i) = k}
Pn (i)
i=1 xj
ϕj|y =k = P
n
1{y (i) = k} · dj=1 xj
(i)
P
i=1
1{y (i) = k}
Pn
i=1
ϕyk =
n
Gaussian Model:
· 1{y (i) = k}
Pn (i)
i=1 xj
µj|y =k =
1{y (i) = k}
Pn
i=1
These estimates are derived based on the type of data and the chosen
event model.
Mohamed Farah (ISAMM) Machine Learning 2024-2025 39 / 56
Inference using Naive Bayes Classifier
p(x|y = k)p(y = k)
p(y = k|x) =
p(x)
Q
d
j=1 p(x j |y = k) p(y = k)
=P hQ i
K d
k=1 j=1 p(x j |y = k) p(y = k)
1{y (i) = 1}
Pn
1+ i=1
ϕy =
2+n
Multinomial Model:
1{y (i) = k}
Pn
1+ i=1
ϕyk =
K +n
Bayesian Framework:
The MAP estimator maximizes the posterior distribution:
ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)
where:
α and β are the hyperparameters of the Beta prior.
B(·, ·) is the Beta function:
Z 1
B(a, b) = t a−1 (1 − t)b−1 dt.
0
Posterior:
Pn (i) (i)
xj +α−1 Pn
P(ϕj|y =k |D) ∝ ϕj|yi=1 (1 − ϕj|y =k ) i=1 (1−xj )+β−1 .
=k
MAP Estimate:
1{xj(i) = 1 ∧ y (i) = k} + α − 1
Pn
i=1
ϕMAP = .
i=1 1{y
j|y =k Pn (i) = k} + α + β − 2
Intuition:
The MAP estimate combines the observed data (likelihood) with prior
knowledge (prior).
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the MAP estimate
reduces to the MLE.
Bayesian Framework:
The EAP estimator computes the expected value of the posterior
distribution:
Z
θEAP = E[θ|D] = θ · P(θ|D) dθ.
ϕα−1
j|y =k (1 − ϕj|y =k )
β−1
P(ϕj|y =k ) = .
B(α, β)
i.e.
Pn (i)
α+ x −1 Pn (i)
ϕj|y =ki=1 j (1 − ϕj|y =k )β+ i=1 (1−xj )−1
P(ϕj|y =k |D) =
(i) (i)
B α + ni=1 xj , β + ni=1 (1 − xj )
P P
1{xj(i) = 1 ∧ y (i) = k}
Pn
α+ i=1
ϕEAP = .
j|y =k
α + β + ni=1 1{y (i) = k}
P
Intuition:
The EAP estimate averages over the entire posterior distribution,
providing a more robust estimate than the MAP.
The hyperparameters α and β act as ”pseudo-counts” to smooth the
estimate.
When α = β = 1, the Beta prior is uniform, and the EAP estimate
reduces to the MLE with Laplace smoothing.
x α−1 (1 − x)β−1
P(x|α, β) = ,
B(α, β)
where:
x ∈ [0, 1],
α > 0 and β > 0 are the shape parameters,
B(α, β) is the Beta function:
Z 1
B(α, β) = t α−1 (1 − t)β−1 dt.
0
Key Properties:
α
Mean: E[x] = α+β .
αβ
Variance: Var(x) = (α+β)2 (α+β+1)
.
Conjugate prior for the Bernoulli and Binomial distributions, i.e. if the
prior is Beta and the likelihood is Bernoulli/Binomial, the posterior is
also Beta.
Applications:
Modeling probabilities in binary events (e.g., success/failure,
heads/tails).
Bayesian inference for proportions.
K
1 Y αk −1
P(x|α) = xk ,
B(α)
k=1
where:
PK
x = (x1 , x2 , . . . , xK ) is a probability vector ( k=1 xk = 1). x lies on
the (K − 1)-dimensional simplex,
α = (α1 , α2 , . . . , αK ) are the shape parameters,
Key Properties:
α
Mean: E[xk ] = PK k .
i=1 αi
αk (α0 −αk ) PK
Variance: Var(xk ) = α20 (α0 +1)
, where α0 = i=1 αi .
α −1
Mode: Mode(θi ) = Pk i for αi > 1
j=1 αj −k
x1α1 −1 x2α2 −1
P(x1 , x2 |α1 , α2 ) = .
B(α1 , α2 )
Letting x1 = x and x2 = 1 − x, this reduces to the Beta distribution:
x α1 −1 (1 − x)α2 −1
P(x|α1 , α2 ) = .
B(α1 , α2 )