2021 Lecture10 BasicML
2021 Lecture10 BasicML
MACHINE LEARNING
BASIC ALGORITHMS
2
Acknowledgements
• This slide is mainly based on the textbook AIMA (3rd edition)
• Some parts of the slide are adapted from
• Maria-Florina Balcan, Introduction to Machine Learning, 10-401,
Spring 2018, Carnegie Mellon University
• Ryan Urbanowicz, An Introduction to Machine Learning, PA CURE
Machine Learning Workshop: December 17, School of Medicine,
University of Pennsylvania
3
Machine Learning
What is machine learning?
• Machine learning involves adaptive mechanisms that
enable computers to learn from experience, learn by
example and learn by analogy.
5
Types of machine learning
Source: https://ldi.upenn.edu/sites/default/files/Introduction-to-Machine-Learning.pdf 6
Types of machine learning
Source: https://www.ceralytics.com/3-types-of-machine-learning/ 7
Machine learning algorithms
8
Source: https://ldi.upenn.edu/sites/default/files/Introduction-to-Machine-Learning.pdf
Supervised learning
• Learn a function that maps an input to an output based on
examples, which are pairs of input-output values.
9
Supervised learning: Examples
• Spam detection
Reasonable RULES
• Predict SPAM if unknown AND (money OR pills)
Linearly separable
• Predict SPAM if 2money + 3pills – 5 known > 0
10
Supervised learning: Examples
• Object detection
Scene text
recognition
11
Supervised learning: More examples
• Weather prediction: Predict the
weather type or the temperature at
any given location…
• Computational economics:
• Predict if a user will click on an ad so as to decide which ad to show
• Predict if a stock will rise or fall (with specific amounts)
12
Classification vs. Regression
• Train a model to predict a categorical dependent variable
• Case studies: predicting disease, classifying images,
predicting customer churn, buy or won’t buy, etc.
Binary classification
vs.
Multiclass classification
vs.
Multilabel classification
13
Classification vs. Regression
• Train a model to predict a continuous dependent variable
• Case studies: predicting height of children, predicting sales,
forecasting stock prices, etc.
14
Unsupervised learning
• Infer a function to describe hidden structure from "unlabeled"
data
• A classification (or categorization) is not included in the observations.
15
Unsupervised learning: Examples
• Social network analysis: cluster users of social networks by
interest (community detection)
Ref: Shevtsov, Alexander, et al. "Analysis of Twitter and YouTube during US elections 2020."
arXiv e-prints (2020): arXiv-2010. 16
Semi-supervised learning
• The model is initially trained
with a small amount of labeled
data and a large amount of
unlabeled data.
17
Reinforcement learning
• The agent learns from the environment by interacting with it
and receives rewards for performing actions.
18
Reinforcement learning: Example
19
Reinforcement learning: Examples
https://openai.com/blog/emergent-tool-use/ https://arxiv.org/pdf/1909.07528.pdf
20
Machine learning and related concepts
Source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-
deep-learning-ai/
21
Machine learning and related concepts
22
ID3
Decision Tree
23
Learning agents – Why learning?
• Unknown environments
• A robot designed to navigate mazes must learn the layout of each
new maze it encounters.
• Environment changes over time
• An agent designed to predict tomorrow’s stock market prices must
learn to adapt when conditions change from boom to bust.
• No idea how to program a solution
• The task to recognizing the faces of family members
24
Learning element
• Design of a learning element is affected by
• Which components is to be improved
• What prior knowledge the agent already has
• What representation is used for the components
• What feedback is available to learn these components
• Type of feedback
• Supervised learning: correct answers for each example
• Unsupervised learning: correct answers not given
• Reinforcement learning: occasional rewards
25
Supervised learning
• Simplest form: learn a function from examples
• Given a training set of 𝑁 example input-output pairs
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥)
• Find a hypothesis 𝒉 such that 𝒉 ≈ 𝒇
• To measure the accuracy of a hypothesis, give it a test set
of examples that are different with those in the training set.
26
Supervised learning
• Construct ℎ so that it agrees with 𝑓.
• The hypothesis ℎ is consistent if it agrees with 𝑓 on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.
Inconsistent linear fit.
Consistent 7th order Consistent 6th order
Consistent linear fit polynomial fit polynomial fit. Consistent sinusoidal fit
27
Supervised learning problems
• 𝒉(𝒙) = 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑖𝑛𝑝𝑢𝑡 𝒙
28
Regression vs. Classification
• Estimating the price
of a house
30
The wait@restaurant problem
• The decision is based on the following attributes
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
31
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?
32
Learning decision trees
• Divide and conquer: Split data into x1 > a ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > b ? x2 > g ?
yes no yes no
33
Learning decision trees
35
ID3 Decision tree algorithm
3. No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
• The default value is calculated from the plurality classification of all
the examples that were used in constructing the node’s parent.
• These are passed along in the variable parent examples
4. No attributes left but both positive and negative examples
→ return the plurality classification of remaining ones.
• Examples of the same description, but different classifications
• Usually an error or noise in the data, nondeterministic domain, or no
observation of an attribute that would distinguish the examples.
36
ID3 Decision tree: Pseudo-code
37
ID3 Decision tree: Pseudo-code
38
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization.
39
A purity measure with entropy
• Entropy is a measure of the uncertainty of a random
variable 𝑉 with values 𝑣𝑘 . An indicator of how
messy your data is
𝟏
𝑯 𝑽 = 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• 𝑣𝑘 is a class in 𝑉 (e.g., yes/no in binary classification)
• 𝑃 𝑣𝑘 is the proportion of the number of elements in class 𝑣𝑘 to the
number of elements in 𝑉
40
A purity measure with entropy
• Entropy is maximal when
all possibilities are
equally likely.
• Entropy is zero in a pure
”yes” (or pure ”no”) node.
6 True,
H(S) = − 6ൗ12 log 2 6ൗ12 − 6ൗ12 log 2 6ൗ12 = 1 6 False
42
ID3 Decision tree: An example
Alternate?
True False
3 T, 3 F 3 T, 3 F
6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 = − log 2 − log + − log 2 − log 2 =1
12 6 6 6 26 12 6 6 6 6
43
ID3 Decision tree: An example
Alternate?
True False
3 T, 3 F 3 T, 3 F
44
ID3 Decision tree: An example
Bar?
True False
3 T, 3 F 3 T, 3 F
6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐵𝑎𝑟 = − log 2 − log 2 + − log 2 − log =1
12 6 6 6 6 12 6 6 6 26
𝐼𝐺 𝐵𝑎𝑟, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐵𝑎𝑟 = 1 − 1 = 0
45
ID3 Decision tree: An example
Sat/Fri?
True False
2 T, 3 F 4 T, 3 F
5 2 2 3 3 7 4 4 3 3
𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = − log 2 − log + − log 2 − log 2
12 5 5 5 25 12 7 7 7 7
= 0.979
46
ID3 Decision tree: An example
Hungry?
True False
5 T, 2 F 1 T, 4 F
7 5 5 2 2 5 1 1 4 4
𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦 = − log 2 − log 2 + − log 2 − log
12 7 7 7 7 12 5 5 5 25
= 0.804
47
ID3 Decision tree: An example
Raining?
True False
2 T, 2 F 4 T, 4 F
4 2 2 2 2 8 4 4 4 4
𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔 = − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 8 8 8 8
𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦 = 1 − 1 = 0
48
ID3 Decision tree: An example
Reservation?
True False
3 T, 2 F 3 T, 4 F
5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = − log 2 − log + − log 2 − log
12 5 5 5 25 12 7 7 7 27
= 0.979
49
ID3 Decision tree: An example
Patrons?
None Full
Some
2F 2 T, 4 F
4T
𝐴𝐸𝑃𝑎𝑡𝑟𝑜𝑛
2 0 0 2 2 4 4 4 0 0
= − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 4 4 4 4
6 2 2 4 4
+ − log − log = 0.541
12 6 26 6 26
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑃𝑎𝑡𝑟𝑜𝑛 = 1 − 0.541 = 0.459
50
ID3 Decision tree: An example
Price
$ $$$
$$
3 T, 3 F 1 T, 3 F
2T
𝐴𝐸𝑃𝑟𝑖𝑐𝑒
6 3 3 3 3 2 2 2 0 0
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
4 1 1 3 3
+ − log − log = 0.770
12 4 24 4 24
𝐼𝐺 𝑃𝑟𝑖𝑐𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑃𝑟𝑖𝑐𝑒 = 1 − 0.770 = 0.23
51
ID3 Decision tree: An example
Type
French Burger
1 T, 1 F Italian 2 T, 2 F
Thai
1 T, 1 F 2 T, 2 F
𝐴𝐸𝑇𝑦𝑝𝑒
2 1 1 1 1 2 1 1 1 1
= − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 2 2 2 2
4 2 2 2 2 4 2 2 2 2
+ − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 4 4 4 4
𝐼𝐺 𝑇𝑦𝑝𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝑇𝑦𝑝𝑒 = 1 − 1 = 0
52
ID3 Decision tree: An example
Est. waiting
time
0-10 > 60
4 T, 2 F 10-30 2F
30-60
1 T, 1 F 1 T, 1 F
𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒
6 4 4 2 2 2 1 1 1 1
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
2 1 1 1 1 2 0 0 2 2
+ − log 2 − log + − log 2 − log 2 = 0.792
12 2 2 2 22 12 2 2 2 2
𝐼𝐺 𝐸𝑠𝑡. 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒, 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 = 1 − 0.792
= 0.208 53
ID3 Decision tree: An example
• Largest Information Gain (0.459) / Smallest Entropy (0.541)
achieved by splitting on Patrons.
Patrons?
None Full
Some
2F 2 T,X?4 F
4T
54
ID3 Decision tree algorithm
True tree
Cannot make it more complex
than what the data supports.
56
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.
No. Writable Updated Size Class
1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean
57
Naïve Bayesian
classification
58
Bayesian classification
• A statistical classifier performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem
59
Bayesian classification
• Performance
• A simple Bayesian classifier (e.g., naïve Bayesian), has comparable
performance with decision tree and selected neural networks.
• Incremental
• Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
• That is, prior knowledge can be combined with observed data.
• Standard
• Even when Bayesian methods are computationally intractable, they
can provide a standard of optimal decision making against which
other methods can be measured
60
The buying computer dataset
age income student credit_rating buys_computer
62
Bayes’ Theorem
• 𝑷(𝑯) (prior probability): the initial probability
• E.g., 𝐗 will buy computer, regardless of age, income, …
• 𝑃(𝐗) : the probability that sample data is observed
• E.g., 𝐗 is 31..40 and has a medium income, regardless of the buying
• 𝑃 𝐗 𝐻) (likelihood): the probability of observing the sample
𝐗, given that the hypothesis holds
• E.g., given that 𝐗 will buy computer, the probability that 𝐗 is 31..40
and has a medium income
𝑷 𝐗 𝑯)𝑷(𝑯)
• 𝑷 𝑯 𝐗) = (posterior probability)
𝑷(𝐗)
• E.g., given that 𝐗 is 31..40 and has a medium income, the probability
that 𝐗 will buy computer
63
Bayes’ Theorem
𝑃 𝐗 𝐻)𝑃(𝐻)
• Informally, 𝑃 𝐻 𝐗) = can be viewed as
𝑃(𝐗)
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟 / 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
• 𝐗 belongs to 𝐶𝑖 iff the probability 𝑃 𝐶𝑖 𝐗) is the highest
among all the 𝑃 𝐶𝑘 𝐗) for all the 𝑘 classes
• Practical difficulty
• Require initial knowledge of many probabilities
• Significant computational cost involved
64
Classification with Bayes’ Theorem
65
Naïve Bayesian classification
• Class-conditional independence: There are no dependence
relationships among the attributes
• The naïve Bayesian classification formula is written as
𝑛
𝑃 𝐗 𝐶𝑖 ) = ෑ 𝑃 𝑥𝑘 𝐶𝑖 ) = 𝑃 𝑥1 𝐶𝑖 ) × 𝑃 𝑥2 𝐶𝑖 ) × ⋯ × 𝑃 𝑥𝑛 𝐶𝑖 )
𝑘=1
• 𝐴𝑘 is categorical: 𝑃 𝑥𝑘 𝐶𝑖 ) is the number of tuples in 𝐶𝑖 having value
𝑥𝑘 for 𝐴𝑘 divided by |𝐶𝑖,𝐷 | (# of tuples of 𝐶𝑖 in 𝐷)
• 𝐴𝑘 is continuous: 𝑃 𝑥𝑘 𝐶𝑖 ) = 𝑔 𝑥𝑘 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 with the Gaussian
(𝑥−𝜇)2
1 −
distribution 𝑔 𝑥, 𝜇, 𝜎 = 𝑒 2𝜎2
𝜎 2𝜋
67
Naïve Bayesian classification: An example
age income student credit_rating buys_computer
<=30 medium yes fair ?
• 𝑃 𝐗 𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 2/9 ∗ 4/9 ∗ 6/9 ∗ 6/9 = 0.044
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 3/5 ∗ 2/5 ∗ 1/5 ∗ 2/5 = 0.019
• 𝑃 𝐗 𝐶𝑖 ) ∗ 𝑃(𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 0.028
• 𝑃(𝐗| 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 0.007
• 𝑃 𝐶𝑖 𝐗)
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠” | 𝐗) = 0.8
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜” | 𝐗) = 0.2
Therefore, 𝐗 belongs to class (“buys_computer = yes”)
68
Avoiding the zero-probability issue
• The naïve Bayesian prediction requires each conditional
probability be non-zero.
𝑛
𝑃 𝐗 𝐶𝑖 ) = ෑ 𝑃 𝑥𝑘 𝐶𝑖 )
𝑘=1
• Otherwise, the predicted probability will be zero
• For example,
age income student credit_rating buys_computer
31…40 medium yes fair ?
70
Naïve Bayesian classification: An example
P(buys_computer = “yes”) 10/16
P(buys_computer = “no”) 6/16
71
Naïve Bayesian classification: An example
age income student credit_rating buys_computer
31..40 medium yes fair ?
• 𝑃 𝐗 𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 5/12 ∗ 5/12 ∗ 7/11 ∗ 7/11 = 0.070
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 1/8 ∗ 3/8 ∗ 2/7 ∗ 3/7 = 0.006
• 𝑃 𝐗 𝐶𝑖 ) ∗ 𝑃(𝐶𝑖 )
• 𝑃(𝐗 | 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠”) = 0.044
• 𝑃(𝐗| 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) ∗ 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜”) = 0.002
• 𝑃 𝐶𝑖 𝐗)
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑦𝑒𝑠” | 𝐗) = 0.953
• 𝑃(𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = “𝑛𝑜” | 𝐗) = 0.047
Therefore, 𝐗 belongs to class (“buys_computer = yes”)
72
Handling missing values
• If the values of some attributes are missing, these attributes
are omitted from the product of probabilities
• As a result, the estimation is less accurate
• For example,
age income student credit_rating buys_computer
? medium yes fair ?
73
Naïve Bayesian classification: Evaluation
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Class conditional independence → loss of accuracy
• Practically, dependencies exist among variables, which cannot be
modeled by Naïve Bayes
• E.g., in medical records, patients’ profile (age, family history, etc.),
symptoms (fever, cough etc.), disease (lung cancer, diabetes, etc.)
74
Quiz 02: Naïve Bayesian classification
76