0% found this document useful (0 votes)
4 views

Week12_PCA_BayesianInference_before_lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Week12_PCA_BayesianInference_before_lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Artificial Intelligence & Data Mining

Week 12

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Recap: Bias and Variance

• Model Complexity Analysis:

Underfitting Overfitting
High bias and low variance Low bias and high variance

2
Recap: Overfitting and Underfitting

• Simple Model

• High Bias

• Cause an algorithm to miss relevant


relations between the input features and
the target outputs.

• Complex Model

• High Variance

• Cause an algorithm to model the noise in


the training set.

3
Recap: Overfitting and Underfitting

1. High complexity -> Large Gap -> Overfitting

2. Low complexity -> Small Gap -> Underfitting

4
Overfitting and Underfitting - Examples

• Suppose that you are training a NN for classification. When you


reduce the model complexity by removing a layer, your validation
accuracy increases.

• How does the model bias and variance change?

• Does your model suffer from overfitting or underfitting?

• Now, if you make your NN even deeper (higher complexity)

• Will your training accuracy increase?

• Will your validation accuracy increase?

5
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

3. Simplify data distribution and dimensionality.

• Dimensionality Reduction

k = 200 k = 50 k=2
6
Dimensionality Reduction

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

7
Outline

• Concept of Dimensionality Reduction

• Principal Component Analysis (PCA)

• How to derive PCA (not examed)

• Examples

8
Carry-on Questions

• Why do we need dimensionality reduction?

• What are two objectives achieved by PCA?

• How to calculate PCA (not examed)?

9
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.


• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

10
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.


• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

• Dimensionality Reduction

• Another way to simplify the complex and high-dimensional data.


• Summarize data with a lower dimensional real valued vector.

11
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to


𝑖=1 .
represent (𝑥𝑖 )𝑁

• Fit 2-d data with a 1-d line.

• Fit 3-d data with a 2-d plane.

12
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to


𝑖=1 .
represent (𝑥𝑖 )𝑁

• Given data points in d dimensions

• Convert them to data points in r<d dimensions

• With minimal loss of information

13
Example: Reduce Data from 2D to 1D
(inches)

(cm)

14
Example: Reduce Data from 2D to 1D

Reduce data from


(inches)

2D to 1D

15
Example: Reduce Data from 2D to 1D

Reduce data from


(inches)

2D to 1D

16
Example: Reduce Data from 3D to 2D

3D data points 2D data points


(𝑥1 , 𝑥2 , 𝑥3 ) (𝑧1 , 𝑧2 )

Approximated by a 2D plane Projected onto the plane


17
Why Dimensionality Reduction?

• What is wrong with high dimensions?

• High Dimensions = Lots of Features

• Complex system to process

• Inefficient algorithm

• Overfitting to noise or other data corruptions.

18
Dimensionality Reduction

• Why do we need dimensionality reduction?

• Remove Feature Redundancy + Noise

• Prevent Overfitting

• Reduced Model Complexity

• Simplified Data distribution

• Better Visualization

19
Principal Component Analysis

• Principal Component Analysis (PCA)

• Simple and popular method.

• Unsupervised learning.

• To learn the “best” low-dimensional subspace for


data projection.

• What does the “best” mean here?

20
Principal Component Analysis

• Two ways to interpret the “Best”:


1. The mean square error (MSE) of the projected data is
minimized.

2. The variance of the projected data is maximized.

• These two objectives are achieved simultaneously by


applying PCA for dimensionality reduction.

• Why and How?

21
Principal Component Analysis

1. The mean square error (MSE) of the projected data


is minimized.

• 𝒙𝒊 ( black ): the original data.

• 𝒗 ( red ) : PCA subspace.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Minimize MSE ( green ) for all {𝒙𝒊 }


i.e., the perpendicular offset.
22
Principal Component Analysis

2. The variance of the projected data is maximized.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Variance = magnitude of ( blue )

• Blue variance is maximized.

23
Principal Component Analysis

• Minimizing MSE <=> Maximizing Projected Variance

• Blue2 + green2 = black2

• Black is fixed (given data)

• Maximizing blue (variance)


is equivalent to
Minimizing green (MSE).

24
Example: PCA

• Which of the following 𝒗 generates higher


projected variance?

25
Example: PCA

• Which of the following 𝒗 generates higher


projected variance?
• Find the projected mean 𝒖

𝒗 • Calculate the distance between each


point to the mean 𝒅𝒊

• The variance of the projected data


𝒅𝟐 𝒖
𝑁
𝒅𝟏 1
𝑣𝑎𝑟 = ෍ 𝒅𝒊 2
𝑁
𝑖=1

26
Example: PCA

• Subspace 𝒗 on the left generates higher


projected variance

Larger projected
variance!
Smaller projected 𝒗
variance!
27
Example: PCA

• Which of the following 𝒗 generates smaller


projected MSE?

28
Example: PCA

• Subspace 𝒗 on the left generates smaller


projected MSE – Smaller Perpendicular Offset

𝒗 Larger
projected MSE

Smaller
projected MSE
𝒗

29
Derivation of PCA - Optional

1. The mean square error (MSE) of


the projected data is minimized.

• For all {𝒙𝑖 }𝑁


𝑖=1 , the green
2

(in average) is minimized.

• Find the best red 𝒗 (unit vector)


to minimize the green2.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

30
Derivation of PCA - Optional

1. The mean square error (MSE) of


the projected data is minimized.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

• The MSE is minimized:


𝑁
1
𝑀𝑆𝐸 = ෍(𝒛𝒊 − 𝒙𝒊 )2
𝑁
𝑖=1

31
Derivation of PCA - Optional

2. The variance of the projected


data is maximized.

• Variance of the projected data:


𝑁
1
𝑉𝑎𝑟 = ෍(𝒛𝒊 − 𝒛ത )2
𝑁
𝑖=1

• 𝒛ത denotes the mean of {𝒛𝒊 }.

32
Derivation of PCA - Optional

2. The variance of the projected


data is maximized.

• For zero-mean data {𝒙𝑖 }𝑁


𝑖=1 , we
have 𝒛ത = 𝟎, then

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 ) = ෍(𝒗𝑇 𝒙𝒊 )2
2
𝑁 𝑁
𝑖=1 𝑖=1

33
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 ,

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 )2 = ෍(𝒗𝑇 𝒙𝒊 )2
𝑁 𝑁
𝑖=1 𝑖=1

• Note that this variance is different from the


variance & bias for overfitting analysis.

34
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1
max ෍(𝒗𝑇 𝒙𝒊 )2
𝒗 2 =1 𝑁
𝑖=1

35
Derivation of PCA - Optional

• Recap: represent a set of data vectors {𝒙𝑖 }𝑁


𝑖=1 as a data matrix 𝑿:

𝒙𝒊 𝑻
𝒙𝟏 𝑻

=
𝑁
𝒙𝒊 𝑻

𝑻
𝒙𝑵

𝑑: feature dimension
36
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

37
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
1 𝑁 1 𝑇 𝑇
• Here σ𝑖=1(𝒗𝑇 𝒙𝒊 )2 = 𝒗 𝑿 𝑿𝒗 is the variance of the projected data.
𝑁 𝑁

• The optimal 𝒗* is the first unit weight vector for the PCA.

• The first Principal Component: 𝒛𝒊 = (𝒗∗𝑇 𝒙𝒊 )𝒗∗

38
Derivation of PCA - Optional

• Mathematical Problem Formulation :

• The optimal 𝒗* is the first unit weight vector for PCA:


𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

• You can calculate the other 𝒗 and 𝒛𝒊 in an incremental fashion:

• Substitute 𝒙𝒊 with the residual (𝒙𝒊 - 𝒛𝒊 ).

• Calculate the next unit weight vector and principle component.

• How to obtain the optimal 𝒗 here?

39
Derivation of PCA - Optional

• Algorithm:

• We need to use Singular Value Decomposition (SVD):

40
Derivation of PCA - Optional

• Algorithm:

𝚺 𝑇
1. Apply SVD to the data matrix 𝑿, as sv𝑑 𝑿 = 𝑼 𝑽 .
𝟎

2. 𝑽 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 , … 𝒗𝒅 , where 𝒗𝒌 is the k-th unit weight vector for PCA.

3. The k-th Principal Component of 𝒙𝒊 : (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘

4. Project to the k-dimensional subspace 𝑽𝒌 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 :

𝒗1 𝑇 𝒙𝒊
𝑽 𝒌 𝑻 𝒙𝒊 = ⋮
𝒗𝑘 𝑇 𝒙𝒊

41
Derivation of PCA - Example

• Given data points, 𝒙1 = 4, 6, 10 , 𝒙2 = 3, 10, 13 , 𝒙3 = −2, −6, −8 .

• Calculate the first unit weight vector 𝒗, and the first Principal Component.

4 6 10
1. Data Matrix: 𝐗 = 3 10 13
−2 −6 −8

2. SVD: 𝐗 = 𝐔𝚺𝑽𝑻

−0.22 0.78 −0.58


3. 𝑽 = −0.57 −0.59 −0.58 −0.22
−0.79 0.20 0.58 𝒗𝟏 = −0.57 𝒛𝑖,𝑘 = (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘
−0.79
42
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

Sample Variance
with rotations,
scales, etc.

43
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

• PCA provides more robust and invariant features.

Reconstructions:

x k=2 k = 10 k = 50 k = 200

44
PCA Example 2
• Image Compression

Original Image

• Divide the original 372x492 image into patches:


• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
45
PCA Example 2
• Image Compression
PCA compression: 144D → 60D

46
PCA Example 2
• Image Compression
PCA compression: 144D → 16D

47
PCA Example 2
• Image Compression

16 most important eigenvectors


2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

48
PCA Example 2
• Image Compression
PCA compression: 144D → 6D

49
PCA Example 2
• Image Compression
PCA compression: 144D → 1D

50
Carry-on Questions

• Why do we need dimensionality reduction?

• To remove redundancy, prevent overfitting, remove noise, etc.

• What are two objectives achieved by PCA?

• Minimizing MSE and Maximizing Projected Variance

• How to calculate PCA?

• Apply SVD to the data matrix. Select the unit weight vector 𝒗𝒌
from the V matrix, and calculate (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘 .
51
Bayesian Inference

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

52
Outline

• Probability and Conditional Probability

• Bayes’ Theorem

• Naïve Bayes

• Examples

53
Carry-on Questions

• What is Bayes’ Theorem?

• What is Naïve Bayes assumption?

54
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic model: always gives you the same output for the same input.

55
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic inference: gives you the same output for the same input.

• Bayesian Inference

• Tossing a coin twice, what outcomes will we get?


• There are 4 possible outcomes of this experiment

𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}

• Probabilistic model: 𝑃𝑟𝑜𝑏 𝐻𝐻 = 0.25

56
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event. Tossing a coin

57
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

We tossed the coin 8


times, and got 4 tails
and 4 heads
sequentially.
{H,H,H,H,T,T,T,T}
58
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

P(h) = probability of getting heads

59
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read


“probability of D given h”, “probability of D conditioned on h” ).

60
Basic Concepts in Probability Theory

• We are interested of P(h|D), because

• We always learn from history / knowledge.

• We normally have the access to the training data D.

• We need to know P(h|D), aims to find the most probable


hypothesis given that data.

61
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read


“probability of D given h).

• Can we calculate P(h|D)? 62


Joint and Conditional Probability

• Conditional Probability 𝑃 𝐴 | 𝐵 :

The probability that A happens given that B happened.

• Joint probability 𝑃 𝐴, 𝐵 :

The probability that A and B both happen.

• 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

63
Bayes’ Theorem

• The simple form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• How to derive the theorem?

• Use joint probability: 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵) 𝑃 𝐵 = 𝑃 𝐵 𝐴) 𝑃(𝐴).

• Therefore:
𝑃(𝐴, 𝐵) 𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 = =
𝑃(𝐵) 𝑃(𝐵)

64
Bayes’ Theorem

• Given P(B) which is a constant, the proportional form:

• Sometimes P(B) can be also calculated as

• This is based on Law of total probability:

65
Bayes’ Theorem

• Now, we can predict 𝑃(ℎ|𝐷):

𝑃 𝐷 ℎ) 𝑃(ℎ)
𝑃 ℎ𝐷 =
𝑃(𝐷)
• Some terminologies we are going to use:

• Prior probability 𝑃(ℎ): prior knowledge of ℎ before observing 𝐷.

• Posterior 𝑃 ℎ 𝐷 : the probability of ℎ after we have observed 𝐷.

• Likelihood 𝑃 𝐷 ℎ : Likelihood of observing 𝐷 given ℎ.

66
Bayes’ Theorem

• Maximum A Posteriori (MAP) hypothesis:

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏(𝐡|𝐃)


𝑃 𝐷 ℎ) 𝑃(ℎ)
• We just learned that 𝑃 ℎ 𝐷 = . Given D, we have
𝑃(𝐷)

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡 𝐏(𝐡)

• If P(h) is constant, MAP is equivalent to Maximum Likelihood (ML):

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡

67
Example Questions

Quiz 1: Flipping a coin

• Suppose we are flipping a fair coin twice.


What is the probability that both flips are heads?

68
Example Questions

• Quiz 2: Flipping a coin

• Suppose we are flipping a fair coin twice.


Given that the outcome of first flip is heads, what is the
probability that both flips are heads?

69
Example Questions

• Quiz 3: Our IE4483 is attended by students from both EEE and IEM.
Only 50% of the IEM students and 30% of the EEE students pass the
exam. Given that 60% of the entire class are EEE students, what is the
percentage of IEM students amongst those who pass the exam?

70
Example Questions

• Quiz 4: Monty Hall Problem

• You’re a contestant on a game show. You see three closed doors, and
behind one of them is a prize. You choose one door, and the host opens
one of the other doors and reveals that there is no prize behind it. Then
he offers you a chance to switch to the remaining door. Should you take
it?

http://en.wikipedia.org/wiki/Monty_Hall_problem 71
Naïve Bayes

• The basic form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• It is straightforward if A and B are both single attributes.

• But what if we have multiple conditions or query events?

• How to represent their conditional probabilities?

72
Naïve Bayes

• Extend from simple example to large training datasets.

• Single attribute Multiple attributes

• How does 𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) relate to 𝑃 𝑎1 𝑣𝑗 ) and 𝑃 𝑎2 𝑣𝑗 ) ?

• Naïve Bayes Assumption:

𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 )𝑃 𝑎2 𝑣𝑗 )

73
Naïve Bayes

• Naïve Bayes assumption:

• The conditional independence assumption

• The values of some features are conditionally independent on the others.

• Mathematical Form:

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 ) … 𝑃 𝑎𝑑 𝑣𝑗 )

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

74
Naïve Bayes - Example

• Play Tennis:

• New Observation: <Sunny,Cool,High,Strong>, do you play tennis?


75
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and


𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)
S = Sunny (outlook)
• According to Bayes’ Theorem, it is to compare
𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) H = High (Humidity)

S = Strong (Wind)

76
Bayes’ Theorem

• Conditional Prob to Joint Prob

𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

𝑃 𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

• 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) is a constant, i.e., same for both choices.

• Let’s call as the 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) as the 𝑃(𝑂𝑏𝑠𝑒𝑟. )

77
Bayes’ Theorem

• Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠)


𝑃 𝑦𝑒𝑠 | 𝑂𝑏𝑠𝑒𝑟. =
𝑃(𝑂𝑏𝑠𝑒𝑟. )

• Again, 𝑃(𝑂𝑏𝑠𝑒𝑟. ) is a constant, i.e., same for both choices.

• Just compare 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠) and 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑛𝑜 𝑃(𝑛𝑜)

78
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and


𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)

• According to Bayes’ Theorem, it is to compare S = Sunny (outlook)


𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠)
H = High (Humidity)
• Based on Naïve Bayes assumption, we have
𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) = S = Strong (Wind)
𝑃 𝑆 𝑦𝑒𝑠 𝑃 𝐶 𝑦𝑒𝑠 𝑃 𝐻 𝑦𝑒𝑠 𝑃 𝑆 𝑦𝑒𝑠
2 3 3 3
= 9 × 9 × 9 × 9 = 0.00823
• P(yes) = #yes / 15 = 9/14 By simply counting

9
• Thus, 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 14 × 0.00823 = 𝟎. 𝟎𝟎𝟓𝟏
79
Naïve Bayes - Example

• Play Tennis:

• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝟎. 𝟎𝟎𝟓𝟏

• Similarly, 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑛𝑜)


= 𝑃 𝑛𝑜 𝑃 𝑆 𝑛𝑜 𝑃 𝐶 𝑛𝑜 𝑃 𝐻 𝑛𝑜 𝑃 𝑆 𝑛𝑜
= 0.36 × 0.6 × 0.2 × 0.8 × 0.6 = 𝟎. 𝟎𝟐𝟎𝟕

• Do you play tennis?

• 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > < 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >

• Given the observation <Sunny,Cool,High,Strong>, more likely you


will NOT play tennis.

80
Carry-on Questions

• What is Bayes’ Theorem?

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• What is Naïve Bayes assumption?

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

81
What we have learned

• Probability and Conditional Probability

• Hypothesis, joint / conditional probability, etc.

• Bayes’ Theorem

• Various forms of Bayes’ Theorem, prior, likelihood, Posterior, etc.

• Naïve Bayes

• Multi-attributes, Naïve Bayes assumption, examples.

82

You might also like