0% found this document useful (0 votes)
81 views

Week 4 - Classification Alternative Techniques

- The document discusses rule-based classifiers, naive Bayes classifiers, and their application to classification problems. - Rule-based classifiers use "if-then" rules to classify records based on attribute values, while naive Bayes classifiers apply Bayes' theorem to estimate the probability that a record belongs to each class. - The document provides examples of how to build classification rules, estimate probabilities from data, and apply naive Bayes to classify a test record based on attribute values and class probabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Week 4 - Classification Alternative Techniques

- The document discusses rule-based classifiers, naive Bayes classifiers, and their application to classification problems. - Rule-based classifiers use "if-then" rules to classify records based on attribute values, while naive Bayes classifiers apply Bayes' theorem to estimate the probability that a record belongs to each class. - The document provides examples of how to build classification rules, estimate probabilities from data, and apply naive Bayes to classify a test record based on attribute values and class probabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3209 - DATA MINING


Week 4: Classification
Alternative Techniques
Duc-Trong Le

Slide credit: Vipin Kumar et al.,


https://www-users.cse.umn.edu/~kumar001/dmbook

Hanoi, 09/2021
Outline

● Rule-Based Classifier
● Naive Bayes Classifier
● K-nearest Neighbor Classifier
● Support Vector Machines
● Artificial Neural Networks (ANN)

2
Rule-Based Classifier

● Classify records by using a collection of


“if…then…” rules
● Rule: (Condition) → y
– where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
– Examples of classification rules:
◆ (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
◆ (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

3
Rule-based Classifier (Example)

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) →
Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
4
Application of Rule-Based Classifier

● A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

5
Rule Coverage and Accuracy

● Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a
rule
● Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that
also satisfy the (Status=Single) → No
consequent of a Coverage = 40%, Accuracy = 50%
rule 6
How does Rule-based Classifier Work?

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

7
Building Classification Rules

● Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R

● Indirect Method:
◆ Extract rules from other classification models
(e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules

8
Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met

9
Example of Sequential Covering

10
Example of Sequential Covering…

11
Direct Method: RIPPER

● Strategy: Apply sequential covering with pruning


● For 2-class problem, choose one of the classes as
positive class, and the other as negative class
– Learn rules for positive class
– Negative class will be default class
● For multi-class problem
– Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
– Learn the rule set for smallest class first, treat the
rest as negative class
– Repeat with next smallest class as positive class
12
Rule Growing

● Two common strategies

13
Rule Evaluation
FOIL: First Order Inductive
● Learner – an early
rule-based learning
algorithm

14
Indirect Methods

15
Indirect Method: C4.5rules

● Extract rules from an unpruned decision tree


● For each rule, r: A → y,
– consider an alternative rule r′: A′ → y where
A′ is obtained by removing one of the
conjuncts in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the alternative rules has lower
pessimistic error rate
– Repeat until we can no longer improve
generalization error

16
Example

17
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians

RIPPER:
(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
→ Reptiles
(Can Fly=Yes,Give Birth=No) → Birds
() → Mammals

18
Advantages of Rule-Based Classifiers

● Has characteristics quite similar to decision trees


– As highly expressive as decision trees
– Easy to interpret (if rules are ordered by
class)
– Performance comparable to decision trees
◆ Can handle redundant and irrelevant attributes
◆ Variable interaction can cause issues (e.g., X-OR problem)
● Better suited for handling imbalanced classes
● Harder to handle missing values in the test set

19
Bayes Classifier

● A probabilistic framework for solving


classification problems
● Conditional Probability:

● Bayes theorem:

20
Using Bayes Theorem for Classification

● Consider each attribute and class


label as random variables
● Given a record with attributes (X1,
X2,…, Xd), the goal is to predict
class Y

– Specifically, we want to find the value


of Y that maximizes P(Y| X1, X2,…, Xd )

● Can we estimate P(Y| X1, X2,…, Xd


) directly from data?

21
Using Bayes Theorem for Classification

● Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

● How to estimate P(X1, X2, …, Xd | Y )?

22
Example Data
Given a Test Record:

• We need to estimate
P(Evade = Yes | X) and P(Evade = No | X)

In the following we will replace


Evade = Yes by Yes, and
Evade = No by No

23
Example Data
Given a Test Record:

24
Conditional Independence

● X and Y are conditionally independent given Z if


P(X|YZ) = P(X|Z)

● Example: Arm length and reading skills


– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are
conditionally independent given age

25
Naïve Bayes Classifier

● Assume independence among attributes Xi when class


is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj) Π P(Xi| Yj) is


maximal.

26
Naïve Bayes on Example Data
Given a Test Record:

P(X | Yes) =
P(Refund = No | Yes) x
P(Divorced | Yes) x
P(Income = 120K | Yes)

P(X | No) =
P(Refund = No | No) x
P(Divorced | No) x
P(Income = 120K | No)

27
Estimate Probabilities from Data
● P(y) = fraction of instances of class y
– e.g., P(No) = 7/10,
P(Yes) = 3/10

● For categorical attributes:


P(Xi =c| y) = nc/ n
– where |Xi =c| is number of
instances having attribute
value Xi =c and belonging
to class y
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
28
Estimate Probabilities from Data

● For continuous attributes:


– Discretization: Partition the range into bins:
◆ Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

29
Estimate Probabilities from Data

● Normal distribution:

– One for each (Xi,Yi) pair


● For (Income, Class=No):
– If Class=No
◆ sample mean = 110
◆ sample variance = 2975

30
Example of Naïve Bayes Classifier
Given a Test Record:

Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 • P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0 × P(Divorced | No)
P(Refund = No | Yes) = 1 × P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7 × 1/7 × 0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 • P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3 × P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0 × P(Income=120K | Yes)
= 1 × 1/3 × 1.2 × 10-9 = 4 × 10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

31
Naïve Bayes Classifier can make decisions with partial
information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7 P(Divorced, Refund = No)
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10 -9 x 1 x 1/3 x 3/10 /
sample variance = 2975 P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
32
Issues with Naïve Bayes Classifier
Given a Test Record:
X = (Married)

Naïve Bayes Classifier:


P(Yes) = 3/10
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(No) = 7/10
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7 P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Divorced | No) = 1/7
P(No | Married) = 4/7 x 7/10 / P(Married)
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0

For Taxable Income:


If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

33
Issues with Naïve Bayes Classifier
Consider the table with Tid = 7 deleted Naïve Bayes Classifier:

P(Refund = Yes | No) = 2/6


P(Refund = No | No) = 4/6
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
P(Marital Status = Divorced | No) = 0
P(Marital Status = Married | No) = 4/6
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0/3
For Taxable Income:
If class = No: sample mean = 91
sample variance = 685
If class = No: sample mean = 90
sample variance = 25

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0
classify X as Yes or No!
-9
P(X | Yes) = 0 X 1/3 X 1.2 X 10 = 0
34
Issues with Naïve Bayes Classifier

● If one of the conditional probabilities is zero,


then the entire expression becomes zero
● Need to use other estimates of conditional probabilities
than simple fractions n: number of training
● Probability estimation: instances belonging to class y
nc: number of instances with
Xi = c and Y = y
v: total number of attribute
values that Xi can take
p: initial estimate of
(P(Xi = c|y) known apriori
m: hyper-parameter for our
confidence in p

35
Example of Naïve Bayes Classifier

A: attributes
M: mammals
N: non-mammals

P(A|M)P(M) > P(A|N)P(N)


=> Mammals

36
Naïve Bayes (Summary)

● Robust to isolated noise points

● Handle missing values by ignoring the instance


during probability estimate calculations

● Robust to irrelevant attributes

● Redundant and correlated attributes will violate


class conditional assumption
– Use other techniques such as Bayesian Belief
Networks (BBN)

37
Naïve Bayes

● How does Naïve Bayes perform on the following dataset?

Conditional independence of attributes is violated


38
Nearest Neighbor Classifiers

● Basic idea:
– If it walks like a duck, quacks like a duck,
then it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

39
Nearest-Neighbor Classifiers

● Requires the following:


– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

40
How to Determine the class label of a Test Sample?

41
Choice of proximity measure matters

● For documents, cosine is better than correlation or


Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but


the cosine similarity measure has different
values for these pairs.

42
Nearest Neighbor Classification…

● Data preprocessing is often required


– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
◆ Example:
– height of a person may vary from 1.5m to 1.8m
– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0


means a standard deviation of 1

43
Nearest Neighbor Classification…

● Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes

44
Improving KNN Efficiency

● Avoid having to compute distance to all objects


in the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
● Condensing
– Determine a smaller set of objects that give
the same performance
● Editing
– Remove objects to improve efficiency

45
Support Vector Machines

● Find a linear hyperplane (decision boundary) that will separate the


data 46
Support Vector Machines

● One Possible Solution


47
Support Vector Machines

● Another possible solution


48
Support Vector Machines

● Other possible solutions


49
Support Vector Machines

● Which one is better? B1 or B2?


● How do you define better?
50
Support Vector Machines

● Find hyperplane maximizes the margin => B1 is better than B2


51
Support Vector Machines

52
Linear SVM

● Linear model:

● Learning the model is equivalent to determining


the values of
– How to find from training data?

53
Learning Linear SVM

● Objective is to maximize:

– Which is equivalent to minimizing:


– Subject to the following constraints:

or

◆ This is a constrained optimization problem


– Solve it using Lagrange multiplier method

54
Learning Linear SVM

● Decision boundary depends only on support


vectors
– If you have data set with same support
vectors, decision boundary will not change

– How to classify using SVM once w and b are


found? Given a test record, xi

55
Support Vector Machines

● What if the problem is not linearly separable?

56
Soft-Margin Support Vector Machines

● What if the problem is not linearly separable?


– Introduce slack variables
◆ Need to minimize:

◆ Subject to:

◆ If k is 1 or 2, this leads to similar objective


function as linear SVM but with different
constraints (see textbook)

57
Soft-Margin Support Vector Machines

● Find the hyperplane that optimizes both factors


58
Nonlinear Support Vector Machines

● What if decision boundary is not linear?

59
Nonlinear Support Vector Machines

● Transform data into higher dimensional space

Decision boundary:

60
Learning Nonlinear SVM

● Optimization problem:

● Which leads to the same set of equations (but


involve Φ(x) instead of x)

61
Learning NonLinear SVM

● Issues:
– What type of mapping function Φ should be
used?
– How to do the computation in high
dimensional space?
◆ Most computations involve dot product Φ(xi)∙ Φ
(xj)
◆ Curse of dimensionality?

62
Learning Nonlinear SVM

● Kernel Trick:
– Φ(xi)∙ Φ(xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
◆ Examples:

63
Example of Nonlinear SVM

SVM with polynomial


degree 2 kernel

64
Learning Nonlinear SVM

● Advantages of using kernel:


– Don’t have to know the mapping function Φ
– Computing dot product Φ(xi)∙ Φ(xj) in the
original space avoids curse of dimensionality

● Not all functions can be kernels


– Must make sure there is a corresponding Φ in
some high-dimensional space
– Mercer’s theorem (see textbook)

65
Characteristics of SVM

● The learning problem is formulated as a convex optimization problem


– Efficient algorithms are available to find the global minima

– Many of the other methods use greedy approaches and find locally
optimal solutions

– High computational complexity for building the model

● Robust to noise
● Overfitting is handled by maximizing the margin of the decision boundary,
● SVM can handle irrelevant and redundant better than many other
techniques
● The user needs to provide the type of kernel function and cost function
● Difficult to handle missing values

66
Artificial Neural Networks (ANN)

● Basic Idea: A complex non-linear function can be


learned as a composition of simple processing units
● ANN is a collection of simple processing units
(nodes) that are connected by directed links (edges)
– Every node receives signals from incoming edges, performs
computations, and transmits signals to outgoing edges
– Analogous to human brain where nodes are neurons and
signals are electrical impulses
– Weight of an edge determines the strength of connection
between the nodes

● Simplest ANN: Perceptron (single neuron)


67
Basic Architecture of Perceptron

Activation Function

● Learns linear decision boundaries


● Related to logistic regression (activation function is sign
instead of sigmoid)
68
Perceptron Example

Output Y is 1 if at least two of the three inputs are equal to 1.

69
Perceptron Example

70
Perceptron Learning Rule

71
Perceptron Learning Rule

72
Example of Perceptron Learning

Weight updates over


Weight updates over first epoch all epochs

73
Perceptron Learning

● Since y is a linear
combination of input
variables, decision
boundary is linear

74
Nonlinearly Separable Data

For nonlinearly separable problems, perceptron learning


algorithm will fail because no linear hyperplane can
separate the data perfectly
XOR Data

75
Multi-layer Neural Network

● More than one hidden layer of


computing nodes
● Every node in a hidden layer
operates on activations from
preceding layer and transmits
activations forward to nodes
of next layer
● Also referred to as
“feedforward neural networks”

76
Multi-layer Neural Network

● Multi-layer neural networks with at least one


hidden layer can solve any type of classification
task involving nonlinear decision surfaces
XOR Data

77
Why Multiple Hidden Layers?

● Activations at hidden layers can be viewed as features


extracted as functions of inputs
● Every hidden layer represents a level of abstraction
– Complex features are compositions of simpler features

● Number of layers is known as depth of ANN


– Deeper networks express complex hierarchy of
features
78
Multi-Layer Network Architecture

Activation value at Activation


node i at layer l Function Linear Predictor

79
Activation Functions

80
Learning Multi-layer Neural Network

81
Gradient Descent

● Loss Function to measure errors across all training


points
Squared Loss:

● Gradient descent: Update parameters in the direction of


“maximum descent” in the loss function across all points

● Stochastic gradient descent (SGD): update the weight for every


instance (minibatch SGD: update over min-batches of instances)

82
Computing Gradients

83
Backpropagation Algorithm

84
Design Issues in ANN

● Number of nodes in input layer


– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k values
● Number of nodes in output layer
– One output for binary class problem
– k or log2 k nodes for k-class problem
● Number of hidden layers and nodes per layer
● Initial weights and biases
● Learning rate, max. number of epochs, mini-batch size
for mini-batch SGD, …

85
Characteristics of ANN

● Multilayer ANN are universal approximators but could


suffer from overfitting if the network is too large
– Naturally represents a hierarchy of features at multiple levels of
abstractions
● Gradient descent may converge to local minimum
● Model building is compute intensive, but testing is fast
● Can handle redundant and irrelevant attributes because
weights are automatically learnt for all attributes
● Sensitive to noise in training data
– This issue can be addressed by incorporating model complexity
in the loss function
● Difficult to handle missing attributes

86
Deep Learning Trends

● Training deep neural networks (more than 5-10 layers) could only
be possible in recent times with:
– Faster computing resources (GPU)

– Larger labeled training sets


● Algorithmic Improvements in Deep Learning
– Responsive activation functions (e.g., RELU)

– Regularization (e.g., Dropout)

– Supervised pre-training

– Unsupervised pre-training (auto-encoders)


● Specialized ANN Architectures:
– Convolutional Neural Networks (for image data)

– Recurrent Neural Networks (for sequence data)


87
– Residual Networks (with skip connections)

You might also like