0% found this document useful (0 votes)
53 views

2021 Lecture10 BasicML

The document discusses machine learning and artificial intelligence. It introduces different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It also explains common machine learning algorithms like decision trees, naive Bayes classification, and k-nearest neighbors.

Uploaded by

Nguyen Thong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

2021 Lecture10 BasicML

The document discusses machine learning and artificial intelligence. It introduces different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It also explains common machine learning algorithms like decision trees, naive Bayes classification, and k-nearest neighbors.

Uploaded by

Nguyen Thong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Artificial Intelligence

MACHINE LEARNING
BASIC ALGORITHMS

Nguyễn Ngọc Thảo – Nguyễn Hải Minh


{nnthao, nhminh}@fit.hcmus.edu.vn
Outline
• Introduction to Machine learning
• ID3 Decision tree
• Naïve Bayesian classification

2
Acknowledgements
• This slide is mainly based on the textbook AIMA (3rd edition)
• Some parts of the slide are adapted from
• Maria-Florina Balcan, Introduction to Machine Learning, 10-401,
Spring 2018, Carnegie Mellon University
• Ryan Urbanowicz, An Introduction to Machine Learning, PA CURE
Machine Learning Workshop: December 17, School of Medicine,
University of Pennsylvania

3
Machine Learning
What is machine learning?
• Machine learning involves adaptive mechanisms that
enable computers to learn from experience, learn by
example and learn by analogy.

5
Types of machine learning

Source: https://ldi.upenn.edu/sites/default/files/Introduction-to-Machine-Learning.pdf 6
Types of machine learning

Source: https://www.ceralytics.com/3-types-of-machine-learning/ 7
Machine learning algorithms

8
Source: https://ldi.upenn.edu/sites/default/files/Introduction-to-Machine-Learning.pdf
Supervised learning
• Learn a function that maps an input to an output based on
examples, which are pairs of input-output values.

9
Supervised learning: Examples
• Spam detection

Reasonable RULES
• Predict SPAM if unknown AND (money OR pills)
Linearly separable
• Predict SPAM if 2money + 3pills – 5 known > 0
10
Supervised learning: Examples
• Object detection

Indoor scene recognition Handwritten digit recognition

Scene text
recognition

11
Supervised learning: More examples
• Weather prediction: Predict the
weather type or the temperature at
any given location…

• Medicine: diagnose a disease (or response to chemo drug X, or whether a


patient is re-admitted soon?)
• Input: from symptoms, lab measurements, test results, DNA tests, …
• Output: one of set of possible diseases, or “none of the above”
• E.g., audiology, thyroid cancer, diabetes, etc.

• Computational economics:
• Predict if a user will click on an ad so as to decide which ad to show
• Predict if a stock will rise or fall (with specific amounts)
12
Classification vs. Regression
• Train a model to predict a categorical dependent variable
• Case studies: predicting disease, classifying images,
predicting customer churn, buy or won’t buy, etc.

Binary classification
vs.
Multiclass classification
vs.
Multilabel classification
13
Classification vs. Regression
• Train a model to predict a continuous dependent variable
• Case studies: predicting height of children, predicting sales,
forecasting stock prices, etc.

14
Unsupervised learning
• Infer a function to describe hidden structure from "unlabeled"
data
• A classification (or categorization) is not included in the observations.

15
Unsupervised learning: Examples
• Social network analysis: cluster users of social networks by
interest (community detection)

Ref: Shevtsov, Alexander, et al. "Analysis of Twitter and YouTube during US elections 2020."
arXiv e-prints (2020): arXiv-2010. 16
Semi-supervised learning
• The model is initially trained
with a small amount of labeled
data and a large amount of
unlabeled data.

17
Reinforcement learning
• The agent learns from the environment by interacting with it
and receives rewards for performing actions.

18
Reinforcement learning: Example

19
Reinforcement learning: Examples

https://openai.com/blog/emergent-tool-use/ https://arxiv.org/pdf/1909.07528.pdf
20
Machine learning and related concepts

Source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-
deep-learning-ai/
21
Machine learning and related concepts

22
ID3
Decision Tree

23
Learning agents – Why learning?
• Unknown environments
• A robot designed to navigate mazes must learn the layout of each
new maze it encounters.
• Environment changes over time
• An agent designed to predict tomorrow’s stock market prices must
learn to adapt when conditions change from boom to bust.
• No idea how to program a solution
• The task to recognizing the faces of family members

24
Learning element
• Design of a learning element is affected by
• Which components is to be improved
• What prior knowledge the agent already has
• What representation is used for the components
• What feedback is available to learn these components

• Type of feedback
• Supervised learning: correct answers for each example
• Unsupervised learning: correct answers not given
• Reinforcement learning: occasional rewards

25
Supervised learning
• Simplest form: learn a function from examples
• Given a training set of 𝑁 example input-output pairs
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥)
• Find a hypothesis 𝒉 such that 𝒉 ≈ 𝒇
• To measure the accuracy of a hypothesis, give it a test set
of examples that are different with those in the training set.

26
Supervised learning
• Construct ℎ so that it agrees with 𝑓.
• The hypothesis ℎ is consistent if it agrees with 𝑓 on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.
Inconsistent linear fit.
Consistent 7th order Consistent 6th order
Consistent linear fit polynomial fit polynomial fit. Consistent sinusoidal fit

27
Supervised learning problems
• 𝒉(𝒙) = 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑖𝑛𝑝𝑢𝑡 𝒙

Discrete valued function Continuous valued function

28
Regression vs. Classification
• Estimating the price
of a house

• Will you pass or fail the exam?


• 2 classes: Fail/Pass

• Is this an apple, an orange or a tomato?


• 3 classes: Apple / Orange / Tomato
29
The wait@restaurant problem

Predicting whether a certain person will wait


to have a seat in a restaurant.

30
The wait@restaurant problem
• The decision is based on the following attributes
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

31
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?

32
Learning decision trees
• Divide and conquer: Split data into x1 > a ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > b ? x2 > g ?

yes no yes no

• After splitting up, each outcome is a new decision tree


learning problem with fewer examples and one less attribute.

33
Learning decision trees

Splitting the examples by testing on attributes


34
ID3 Decision tree algorithm
1. The remaining examples are all positive (or all negative),
→ DONE, it is possible to answer Yes or No.
• E.g., in Figure (b), None and Some branches
2. There are some positive and some negative examples →
choose the best attribute to split them
• E.g., in Figure (b), Hungry is used to split the remaining examples

35
ID3 Decision tree algorithm
3. No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
• The default value is calculated from the plurality classification of all
the examples that were used in constructing the node’s parent.
• These are passed along in the variable parent examples
4. No attributes left but both positive and negative examples
→ return the plurality classification of remaining ones.
• Examples of the same description, but different classifications
• Usually an error or noise in the data, nondeterministic domain, or no
observation of an attribute that would distinguish the examples.

36
ID3 Decision tree: Pseudo-code

function DECISION-TREE-LEARNING(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)


returns a tree
if 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 is empty No examples left

then return PLURALITY-VALUE(𝑝𝑎𝑟𝑒𝑛𝑡 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)


else if all 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 have the same classification
remaining examples
then return the classification are all pos/all neg
else if 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
No attributes left but
else examples are still pos & neg

37
ID3 Decision tree: Pseudo-code

function DECISION-TREE-LEARNING(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)


returns a tree

else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣𝑘 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣𝑘
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← DECISION-TREE-LEARNING(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣𝑘 ) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒

38
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization.

• Advanced: Split on each variable so that the purity of each


split increases (i.e. either only yes or only no)
• E.g., using Entropy to measure the purity of data

39
A purity measure with entropy
• Entropy is a measure of the uncertainty of a random
variable 𝑉 with values 𝑣𝑘 . An indicator of how
messy your data is
𝟏
𝑯 𝑽 = ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• 𝑣𝑘 is a class in 𝑉 (e.g., yes/no in binary classification)
• 𝑃 𝑣𝑘 is the proportion of the number of elements in class 𝑣𝑘 to the
number of elements in 𝑉

40
A purity measure with entropy
• Entropy is maximal when
all possibilities are
equally likely.
• Entropy is zero in a pure
”yes” (or pure ”no”) node.

• Decision tree aims to decrease the entropy in each node.


41
The wait@restaurant training data
T = True, F = False

6 True,
H(S) = − 6ൗ12 log 2 6ൗ12 − 6ൗ12 log 2 6ൗ12 = 1 6 False
42
ID3 Decision tree: An example

Alternate?

True False

3 T, 3 F 3 T, 3 F

• Calculate Average Entropy of attribute Alternate


𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 = 𝑃(𝐴𝑙𝑡 = 𝑇) × 𝐻(𝐴𝑙𝑡 = 𝑇) + 𝑃(𝐴𝑙𝑡 = 𝐹) × 𝐻(𝐴𝑙𝑡 = 𝐹)

6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 = − log 2 − log + − log 2 − log 2 =1
12 6 6 6 26 12 6 6 6 6

43
ID3 Decision tree: An example

Alternate?

True False

3 T, 3 F 3 T, 3 F

• Information Gain is the difference in entropy from before


to after the set 𝑆 is split on the selected attribute.
𝐼𝐺 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 = 1 − 1 = 0

44
ID3 Decision tree: An example
Bar?

True False

3 T, 3 F 3 T, 3 F

6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐵𝑎𝑟 = − log 2 − log 2 + − log 2 − log =1
12 6 6 6 6 12 6 6 6 26

𝐼𝐺 𝐵𝑎𝑟, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐵𝑎𝑟 = 1 − 1 = 0

45
ID3 Decision tree: An example

Sat/Fri?

True False

2 T, 3 F 4 T, 3 F

5 2 2 3 3 7 4 4 3 3
𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = − log 2 − log + − log 2 − log 2
12 5 5 5 25 12 7 7 7 7
= 0.979

𝐼𝐺 𝑆𝑎𝑡Τ𝐹𝑟𝑖? , 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 1 − 0.979 = 0.021

46
ID3 Decision tree: An example

Hungry?

True False

5 T, 2 F 1 T, 4 F

7 5 5 2 2 5 1 1 4 4
𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦 = − log 2 − log 2 + − log 2 − log
12 7 7 7 7 12 5 5 5 25
= 0.804

𝐼𝐺 𝐻𝑢𝑛𝑔𝑟𝑦, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦 = 1 − 0.804 = 0.196

47
ID3 Decision tree: An example

Raining?

True False

2 T, 2 F 4 T, 4 F

4 2 2 2 2 8 4 4 4 4
𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔 = − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 8 8 8 8

𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦 = 1 − 1 = 0

48
ID3 Decision tree: An example
Reservation?

True False

3 T, 2 F 3 T, 4 F

5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = − log 2 − log + − log 2 − log
12 5 5 5 25 12 7 7 7 27
= 0.979

𝐼𝐺 𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 1 − 0.979 = 0.021

49
ID3 Decision tree: An example

Patrons?

None Full
Some
2F 2 T, 4 F
4T

𝐴𝐸𝑃𝑎𝑡𝑟𝑜𝑛
2 0 0 2 2 4 4 4 0 0
= − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 4 4 4 4
6 2 2 4 4
+ − log − log = 0.541
12 6 26 6 26
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑃𝑎𝑡𝑟𝑜𝑛 = 1 − 0.541 = 0.459
50
ID3 Decision tree: An example

Price

$ $$$
$$
3 T, 3 F 1 T, 3 F
2T

𝐴𝐸𝑃𝑟𝑖𝑐𝑒
6 3 3 3 3 2 2 2 0 0
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
4 1 1 3 3
+ − log − log = 0.770
12 4 24 4 24
𝐼𝐺 𝑃𝑟𝑖𝑐𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑃𝑟𝑖𝑐𝑒 = 1 − 0.770 = 0.23
51
ID3 Decision tree: An example

Type

French Burger

1 T, 1 F Italian 2 T, 2 F
Thai

1 T, 1 F 2 T, 2 F

𝐴𝐸𝑇𝑦𝑝𝑒
2 1 1 1 1 2 1 1 1 1
= − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 2 2 2 2
4 2 2 2 2 4 2 2 2 2
+ − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 4 4 4 4
𝐼𝐺 𝑇𝑦𝑝𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑇𝑦𝑝𝑒 = 1 − 1 = 0
52
ID3 Decision tree: An example
Est. waiting
time
0-10 > 60

4 T, 2 F 10-30 2F
30-60

1 T, 1 F 1 T, 1 F

𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒
6 4 4 2 2 2 1 1 1 1
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
2 1 1 1 1 2 0 0 2 2
+ − log 2 − log + − log 2 − log 2 = 0.792
12 2 2 2 22 12 2 2 2 2
𝐼𝐺 𝐸𝑠𝑡. 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 = 1 − 0.792
= 0.208 53
ID3 Decision tree: An example
• Largest Information Gain (0.459) / Smallest Entropy (0.541)
achieved by splitting on Patrons.
Patrons?
None Full
Some
2F 2 T,X?4 F
4T

• Continue making new splits, always purifying nodes

54
ID3 Decision tree algorithm

True tree
Cannot make it more complex
than what the data supports.

Induced tree (from examples)


55
Performance measurement
• How do we know that ℎ ≈ 𝑓?
1. Use theorems of computational or statistical learning theory
2. Try ℎ on a new test set of examples
• Use the same distribution over example space as training set

Learning curve = % correct on


test set as a function of training
set size

56
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.
No. Writable Updated Size Class
1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean

57
Naïve Bayesian
classification

58
Bayesian classification
• A statistical classifier performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem

59
Bayesian classification
• Performance
• A simple Bayesian classifier (e.g., naïve Bayesian), has comparable
performance with decision tree and selected neural networks.
• Incremental
• Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
• That is, prior knowledge can be combined with observed data.
• Standard
• Even when Bayesian methods are computationally intractable, they
can provide a standard of optimal decision making against which
other methods can be measured

60
The buying computer dataset
age income student credit_rating buys_computer

<=30 high no fair no


<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 61
Bayes’ Theorem
• Total Probability Theorem: 𝑃 𝐵 = σ𝑀
𝑖=1 𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 )

• Let 𝐗 be a data sample (“evidence”) with unknown class


label and 𝐻 be a hypothesis that 𝐗 belongs to class 𝐶
𝑷 𝐗 𝑯)𝑷(𝑯)
• Bayes’ Theorem: 𝑷 𝑯 𝐗) =
𝑷(𝐗)

• Classification is to determine 𝑃 𝐻 𝐗), the probability that


the hypothesis 𝐻 holds given the observed data sample 𝐗.

62
Bayes’ Theorem
• 𝑷(𝑯) (prior probability): the initial probability
• E.g., 𝐗 will buy computer, regardless of age, income, …
• 𝑃(𝐗) : the probability that sample data is observed
• E.g., 𝐗 is 31..40 and has a medium income, regardless of the buying
• 𝑃 𝐗 𝐻) (likelihood): the probability of observing the sample
𝐗, given that the hypothesis holds
• E.g., given that 𝐗 will buy computer, the probability that 𝐗 is 31..40
and has a medium income
𝑷 𝐗 𝑯)𝑷(𝑯)
• 𝑷 𝑯 𝐗) = (posterior probability)
𝑷(𝐗)
• E.g., given that 𝐗 is 31..40 and has a medium income, the probability
that 𝐗 will buy computer
63
Bayes’ Theorem
𝑃 𝐗 𝐻)𝑃(𝐻)
• Informally, 𝑃 𝐻 𝐗) = can be viewed as
𝑃(𝐗)
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟 / 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
• 𝐗 belongs to 𝐶𝑖 iff the probability 𝑃 𝐶𝑖 𝐗) is the highest
among all the 𝑃 𝐶𝑘 𝐗) for all the 𝑘 classes

• Practical difficulty
• Require initial knowledge of many probabilities
• Significant computational cost involved

64
Classification with Bayes’ Theorem

• Let 𝐷 be a training set of tuples and associated class labels


• Each tuple is represented by a 𝑛-attribute 𝐗 = (𝑥1, 𝑥2, … , 𝑥𝑛 )
• Suppose there are 𝑚 classes 𝐶1, 𝐶2, … , 𝐶𝑚
• Classification is to derive the maximum posteriori 𝑃 𝐶𝑖 𝐗)
from Bayes’ theorem
𝑃 𝐗 𝐶𝑖 )𝑃(𝐶𝑖 )
𝑃 𝐶𝑖 𝐗) =
𝑃(𝐗)
• 𝑃(𝑋) is constant for all classes, only 𝑃 𝐗 𝐶𝑖 )𝑃(𝐶𝑖 ) needs to
be maximized.

65
Naïve Bayesian classification
• Class-conditional independence: There are no dependence
relationships among the attributes
• The naïve Bayesian classification formula is written as
𝑛

𝑃 𝐗 𝐶𝑖 ) = ෑ 𝑃 𝑥𝑘 𝐶𝑖 ) = 𝑃 𝑥1 𝐶𝑖 ) × 𝑃 𝑥2 𝐶𝑖 ) × ⋯ × 𝑃 𝑥𝑛 𝐶𝑖 )
𝑘=1
• 𝐴𝑘 is categorical: 𝑃 𝑥𝑘 𝐶𝑖 ) is the number of tuples in 𝐶𝑖 having value
𝑥𝑘 for 𝐴𝑘 divided by |𝐶𝑖,𝐷 | (# of tuples of 𝐶𝑖 in 𝐷)
• 𝐴𝑘 is continuous: 𝑃 𝑥𝑘 𝐶𝑖 ) = 𝑔 𝑥𝑘 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 with the Gaussian
(𝑥−𝜇)2
1 −
distribution 𝑔 𝑥, 𝜇, 𝜎 = 𝑒 2𝜎2
𝜎 2𝜋

• Count class distributions only → computation cost reduced


66
Naïve Bayesian classification: An example
P(buys_computer = “yes”) 9/14
P(buys_computer = “no”) 5/14

buys_computer = “yes” buys_computer = “no”


age = “<=30” 2/9 3/5
age = “31…40” 4/9 0/5
age = “>40” 3/9 2/5
income = “low” 3/9 1/5
income = “medium” 4/9 2/5
income = “high” 2/9 2/5
student = “yes” 6/9 1/5
student = “no” 3/9 4/5
credit_rating = “fair” 6/9 2/5
credit_rating = “excellent” 3/9 3/5

67
Naïve Bayesian classification: An example
age income student credit_rating buys_computer
<=30 medium yes fair ?
• 𝑃 𝐗 𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 2/9 ∗ 4/9 ∗ 6/9 ∗ 6/9 = 0.044
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 3/5 ∗ 2/5 ∗ 1/5 ∗ 2/5 = 0.019
• 𝑃 𝐗 𝐶𝑖 ) ∗ 𝑃(𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 0.028
• 𝑃(𝐗| 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 0.007
• 𝑃 𝐶𝑖 𝐗)
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠” | 𝐗) = 0.8
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜” | 𝐗) = 0.2
Therefore, 𝐗 belongs to class (“buys_computer = yes”)
68
Avoiding the zero-probability issue
• The naïve Bayesian prediction requires each conditional
probability be non-zero.
𝑛

𝑃 𝐗 𝐶𝑖 ) = ෑ 𝑃 𝑥𝑘 𝐶𝑖 )
𝑘=1
• Otherwise, the predicted probability will be zero
• For example,
age income student credit_rating buys_computer
31…40 medium yes fair ?

• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 0 ∗ 2/5 ∗ 1/5 ∗ 2/5 = 0


• Therefore, the conclusion is always yes regardless the value of
𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”)
69
Avoiding the zero-probability issue
• Laplacian correction (or Laplacian estimator)
𝑪𝒊 + 𝟏 𝒙𝒌 ∪ 𝑪𝒊 + 𝟏
𝑷 𝑪𝒊 = 𝑷 𝒙𝒌 𝑪𝒊 ) =
𝑫 +𝒎 𝑪𝒊 + 𝒓
• where 𝑚 is the number of classes, 𝑥𝑘 ∪ 𝐶𝑖 denotes the number of
tuples contains both 𝐴𝑘 = 𝑥𝑘 and 𝐶𝑖 , and 𝑟 is the number of values of
attribute 𝐴𝑘
• The “corrected” probability estimates are close to their
“uncorrected” counterparts

70
Naïve Bayesian classification: An example
P(buys_computer = “yes”) 10/16
P(buys_computer = “no”) 6/16

buys_computer = “yes” buys_computer = “no”


age = “<=30” 3/12 4/8
age = “31…40” 5/12 1/8
age = “>40” 4/12 3/8
income = “low” 4/12 2/8
income = “medium” 5/12 3/8
income = “high” 3/12 3/8
student = “yes” 7/11 2/7
student = “no” 4/11 5/7
credit_rating = “fair” 7/11 3/7
credit_rating = “excellent” 4/11 4/7

71
Naïve Bayesian classification: An example
age income student credit_rating buys_computer
31..40 medium yes fair ?
• 𝑃 𝐗 𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 5/12 ∗ 5/12 ∗ 7/11 ∗ 7/11 = 0.070
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 1/8 ∗ 3/8 ∗ 2/7 ∗ 3/7 = 0.006
• 𝑃 𝐗 𝐶𝑖 ) ∗ 𝑃(𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 0.044
• 𝑃(𝐗| 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 0.002
• 𝑃 𝐶𝑖 𝐗)
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠” | 𝐗) = 0.953
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜” | 𝐗) = 0.047
Therefore, 𝐗 belongs to class (“buys_computer = yes”)
72
Handling missing values
• If the values of some attributes are missing, these attributes
are omitted from the product of probabilities
• As a result, the estimation is less accurate
• For example,
age income student credit_rating buys_computer
? medium yes fair ?

73
Naïve Bayesian classification: Evaluation

• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Class conditional independence → loss of accuracy
• Practically, dependencies exist among variables, which cannot be
modeled by Naïve Bayes
• E.g., in medical records, patients’ profile (age, family history, etc.),
symptoms (fever, cough etc.), disease (lung cancer, diabetes, etc.)

• How to deal with these dependencies?


• Bayesian Belief Networks

74
Quiz 02: Naïve Bayesian classification

• The data represent files on a computer system. Possible


values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive naïve Bayesian probabilities for virus identification in
either cases, with or without Laplacian correction.
No. Writable Updated Size Class
1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean
75
THE END

76

You might also like