Week 4 - Classification Alternative Techniques
Week 4 - Classification Alternative Techniques
Since 2004
Hanoi, 09/2021
Outline
● Rule-Based Classifier
● Naive Bayes Classifier
● K-nearest Neighbor Classifier
● Support Vector Machines
● Artificial Neural Networks (ANN)
2
Rule-Based Classifier
3
Rule-based Classifier (Example)
5
Rule Coverage and Accuracy
● Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a
rule
● Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that
also satisfy the (Status=Single) → No
consequent of a Coverage = 40%, Accuracy = 50%
rule 6
How does Rule-based Classifier Work?
7
Building Classification Rules
● Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R
● Indirect Method:
◆ Extract rules from other classification models
(e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
8
Direct Method: Sequential Covering
9
Example of Sequential Covering
10
Example of Sequential Covering…
11
Direct Method: RIPPER
13
Rule Evaluation
FOIL: First Order Inductive
● Learner – an early
rule-based learning
algorithm
14
Indirect Methods
15
Indirect Method: C4.5rules
16
Example
17
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians
RIPPER:
(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
→ Reptiles
(Can Fly=Yes,Give Birth=No) → Birds
() → Mammals
18
Advantages of Rule-Based Classifiers
19
Bayes Classifier
● Bayes theorem:
20
Using Bayes Theorem for Classification
21
Using Bayes Theorem for Classification
● Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
22
Example Data
Given a Test Record:
• We need to estimate
P(Evade = Yes | X) and P(Evade = No | X)
23
Example Data
Given a Test Record:
24
Conditional Independence
25
Naïve Bayes Classifier
26
Naïve Bayes on Example Data
Given a Test Record:
P(X | Yes) =
P(Refund = No | Yes) x
P(Divorced | Yes) x
P(Income = 120K | Yes)
P(X | No) =
P(Refund = No | No) x
P(Divorced | No) x
P(Income = 120K | No)
27
Estimate Probabilities from Data
● P(y) = fraction of instances of class y
– e.g., P(No) = 7/10,
P(Yes) = 3/10
29
Estimate Probabilities from Data
● Normal distribution:
30
Example of Naïve Bayes Classifier
Given a Test Record:
31
Naïve Bayes Classifier can make decisions with partial
information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7 P(Divorced, Refund = No)
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10 -9 x 1 x 1/3 x 3/10 /
sample variance = 2975 P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
32
Issues with Naïve Bayes Classifier
Given a Test Record:
X = (Married)
33
Issues with Naïve Bayes Classifier
Consider the table with Tid = 7 deleted Naïve Bayes Classifier:
35
Example of Naïve Bayes Classifier
A: attributes
M: mammals
N: non-mammals
36
Naïve Bayes (Summary)
37
Naïve Bayes
● Basic idea:
– If it walks like a duck, quacks like a duck,
then it’s probably a duck
Compute
Distance Test
Record
39
Nearest-Neighbor Classifiers
40
How to Determine the class label of a Test Sample?
41
Choice of proximity measure matters
111111111110 000000000001
vs
011111111111 100000000000
42
Nearest Neighbor Classification…
43
Nearest Neighbor Classification…
44
Improving KNN Efficiency
45
Support Vector Machines
52
Linear SVM
● Linear model:
53
Learning Linear SVM
● Objective is to maximize:
or
●
54
Learning Linear SVM
55
Support Vector Machines
56
Soft-Margin Support Vector Machines
◆ Subject to:
57
Soft-Margin Support Vector Machines
59
Nonlinear Support Vector Machines
Decision boundary:
60
Learning Nonlinear SVM
● Optimization problem:
61
Learning NonLinear SVM
● Issues:
– What type of mapping function Φ should be
used?
– How to do the computation in high
dimensional space?
◆ Most computations involve dot product Φ(xi)∙ Φ
(xj)
◆ Curse of dimensionality?
62
Learning Nonlinear SVM
● Kernel Trick:
– Φ(xi)∙ Φ(xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
◆ Examples:
63
Example of Nonlinear SVM
64
Learning Nonlinear SVM
65
Characteristics of SVM
– Many of the other methods use greedy approaches and find locally
optimal solutions
● Robust to noise
● Overfitting is handled by maximizing the margin of the decision boundary,
● SVM can handle irrelevant and redundant better than many other
techniques
● The user needs to provide the type of kernel function and cost function
● Difficult to handle missing values
66
Artificial Neural Networks (ANN)
Activation Function
69
Perceptron Example
70
Perceptron Learning Rule
71
Perceptron Learning Rule
72
Example of Perceptron Learning
73
Perceptron Learning
● Since y is a linear
combination of input
variables, decision
boundary is linear
74
Nonlinearly Separable Data
75
Multi-layer Neural Network
76
Multi-layer Neural Network
77
Why Multiple Hidden Layers?
79
Activation Functions
80
Learning Multi-layer Neural Network
81
Gradient Descent
82
Computing Gradients
83
Backpropagation Algorithm
84
Design Issues in ANN
85
Characteristics of ANN
86
Deep Learning Trends
● Training deep neural networks (more than 5-10 layers) could only
be possible in recent times with:
– Faster computing resources (GPU)
– Supervised pre-training