DSTBD_10-DMClassification-ENG
DSTBD_10-DMClassification-ENG
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Classification
◼ Objectives
◼ prediction of a class label
◼ definition of an interpretable model of a given
phenomenon
training data
model
DB
MG
2
Classification
◼ Applications
◼ detection of customer propension to leave a company
(churn or attrition)
◼ fraud detection
◼ classification of different pathology types
◼ …
training data
model
DB
MG
5
Classification: definition
◼ Given
◼ a collection of class labels
◼ a collection of data objects labelled with a
class label
◼ Find a descriptive profile of each class,
which will allow the assignment of
unlabeled objects to the appropriate class
DB
MG
6
Definitions
◼ Training set
◼ Collection of labeled data objects used to learn
the classification model
◼ Test set
◼ Collection of labeled data objects used to
validate the classification model
DB
MG
7
Classification techniques
◼ Decision trees
◼ Classification rules
◼ Association rules
◼ Neural Networks
◼ Naïve Bayes and Bayesian Networks
◼ k-Nearest Neighbours (k-NN)
◼ Support Vector Machines (SVM)
◼ …
DB
MG
8
Evaluation of classification techniques
◼ Accuracy ◼ Efficiency
◼ quality of the prediction ◼ model building time
classification time
◼ Interpretability ◼
DB
MG
9
Decision trees
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Example of decision tree
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
11
Another example of decision tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
12
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
13
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
14
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
16
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
17
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
18
Decision tree induction
◼ Many algorithms to build a decision tree
◼ Hunt’s Algorithm (one of the earliest)
◼ CART
◼ ID3, C4.5, C5.0
◼ SLIQ, SPRINT
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
19
General structure of Hunt’s algorithm
Tid Refund Marital Taxable
Status Income Cheat
Basic steps 1
2
3
Yes
No
No
Single
Married
Single
125K
100K
70K
No
No
No
◼
to split Dt and label node t as A
Dt,, set of training records
◼ split Dt into smaller subsets and that reach a node t
recursively apply the procedure to
each subset Dt
◼ If Dt contains records that belong
to the same class yt t
◼ then t is a leaf node labeled as yt
◼ If Dt is an empty set
◼ then t is a leaf node labeled as the
default (majority) class, yd
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
20
Hunt’s algorithm
Tid Refund Marital Taxable
Status Income Cheat
Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
21
Decision tree induction
◼ Adopts a greedy strategy
◼ “Best” attribute for the split is selected locally at
each step
◼ not a global optimum
◼ Issues
◼ Structure of test condition
◼ Binary split versus multiway split
◼ Selection of the best attribute for the split
◼ Stopping condition for the algorithm
DB
MG
22
Structure of test condition
◼ Depends on attribute type
◼ nominal
◼ ordinal
◼ continuous
◼ Depends on number of outgoing edges
◼ 2-way split
◼ multi-way split
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
23
Splitting on nominal attributes
◼ Multi-way split
◼ use as many partitions as distinct values
CarType
Family Luxury
◼ Binary split Sports
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
24
Splitting on ordinal attributes
◼ Multi-way split
◼ use as many partitions as distinct values
◼ Binary split Small
Size
Large
◼ Divides values into two subsets Medium
Size
{Small,
What about this split? Large} {Medium}
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
25
Splitting on continuous attributes
◼ Different techniques
◼ Discretization to form an ordinal categorical
attribute
◼ Static – discretize once at the beginning
◼ Dynamic – discretize during tree induction
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering
◼ Binary decision (A < v) or (A v)
◼ consider all possible splits and find the best cut
◼ more computationally intensive
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
26
Splitting on continuous attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
27
Selection of the best attribute
Before splitting: 10 records of class 0,
10 records of class 1
Own Car Student
Car? Type? ID?
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
28
Selection of the best attribute
◼ Attributes with homogeneous class
distribution are preferred
◼ Need a measure of node impurity
C0: 5 C0: 9
C1: 5 C1: 1
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
29
Measures of node impurity
◼ Many different measures available
◼ Gini index
◼ Entropy
◼ Misclassification error
◼ Different algorithms rely on different
measures
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
30
How to find the best attribute
Before Splitting: C0 N00
M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
31
GINI impurity measure
◼ Gini Index for a given node t
GINI (t ) = 1 − [ p( j | t )]2
j
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
32
Examples for computing GINI
GINI (t ) = 1 − [ p( j | t )]2
j
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
33
Splitting based on GINI
◼ Used in CART, SLIQ, SPRINT
◼ When a node p is split into k partitions (children), the
quality of the split is computed as
k
ni
GINI split = GINI (i )
i =1 n
where
ni = number of records at child i
n = number of records at node p
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
34
Computing GINI index: Boolean attribute
◼ Splits into two partitions
◼ larger and purer partitions are sought for
Parent
B? C1 6
C2 6
Yes No
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(split on B)
= 0.408 C1 5 1 = 7/12 * 0.408 +
C2 2 4 5/12 * 0.32
Gini(N2)
Gini=? = 0.371
= 1 – (1/5)2 – (4/5)2
= 0.32
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
35
Computing GINI index: Categorical attribute
◼ For each distinct value, gather counts for each class in the
dataset
◼ Use the count matrix to make decisions
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
36
Computing GINI index: Continuous attribute
Tid Refund Marital Taxable
Status Income Cheat
◼ Binary decision on one 1 Yes Single 125K No
values
4 Yes Married 120K No
5 No Divorced 95K Yes
= Number of distinct values 6 No Married 60K No
count matrix
8 No Single 85K Yes
9 No Married 75K No
partitions Taxable
Income
◼ A<v > 80K?
◼ Av
Yes No
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
37
Computing GINI index: Continuous attribute
◼ For each attribute
◼ Sort the attribute on values
◼ Linearly scan these values, each time updating the count matrix
and computing gini index
◼ Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
38
Entropy impurity measure (INFO)
◼ Entropy at a given node t
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
40
Splitting Based on INFO
◼ Information Gain
n
split i =1
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
45
Stopping Criteria for Tree Induction
◼ Stop expanding a node when all the records
belong to the same class
◼ Early termination
◼ Pre-pruning
◼ Post-pruning
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
46
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
47
Overfitting due to Noise
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
48
How to address overfitting
◼ Pre-Pruning (Early Stopping Rule)
◼ Stop the algorithm before it becomes a fully-grown tree
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
49
How to address overfitting
◼ Post-pruning
◼ Grow decision tree to its entirety
◼ Trim the nodes of the decision tree in a bottom-
up fashion
◼ If generalization error improves after trimming,
replace sub-tree by a leaf node.
◼ Class label of leaf node is determined from
majority class of instances in the sub-tree
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
50
Data fragmentation
◼ Number of instances gets smaller as you
traverse down the tree
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
51
Handling missing attribute values
◼ Missing values affect decision tree
construction in three different ways
◼ Affect how impurity measures are computed
◼ Affect how to distribute instances with missing
value to child nodes
◼ Affect how a test instance with missing value is
classified
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
52
Decision boundary
1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
56
Oblique decision trees
x+y<1
Class = + Class =
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
57
Evaluation of decision trees
◼ Accuracy ◼ Efficiency
◼ For simple datasets, ◼ Fast model building
comparable to other ◼ Very fast classification
classification techniques ◼ Scalability
◼ Interpretability ◼ Scalable both in training set
◼ Model is interpretable for size and attribute number
small trees ◼ Robustness
◼ Single predictions are ◼ Difficult management of
interpretable missing data
◼ Incrementality
◼ Not incremental
DB
MG
59
Random Forest
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Random Forest
◼ Ensemble learning technique
◼ multiple base models are combined
◼ to improve accuracy and stability
◼ to avoid overfitting
DB
MG Bibliography: Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springer, 2009 61
Random Forest
Original Training data Dataset
… Dj …
Random subsets D1 DB
Class
DB
MG
62
Bootstrap aggregation
◼ Given a training set D of n instances, it selects B
times a random sample with replacement from D
and trains trees on these dataset samples
◼ For b = 1, ..., B
◼ Sample with replacement n’ training examples, n’≤n
DB
MG
63
Feature bagging
◼ Selects, for each candidate split in the learning
process, a random subset of the features
◼ Being p the number of features, √𝑝 features are typically
selected
◼ Trees are decorrelated
◼ Feature subsets are sampled randomly, hence different
features can be selected as best attributes for the split
DB
MG
64
Random Forest – Algorithm Recap
◼
DB
MG
65
Evaluation of random forests
◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Fast model building
Very fast classification
◼ Interpretability ◼
DB
MG
67
Rule-based classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Rule-based classifier
◼ Classify records by using a collection of “if…then…”
rules
◼ Rule: (Condition) → y
◼ where
◼ Condition is a conjunction of simple predicates
◼ y is the class label
◼ LHS: rule antecedent or condition
◼ RHS: rule consequent
◼ Examples of classification rules
◼ (Blood Type=Warm) (Lay Eggs=Yes) → Birds
◼ (Taxable Income < 50K) (Refund=Yes) → Cheat=No
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
69
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
71
Rule-based classification
R1: (Give Birth = no) (Can Fly = yes) → Birds
R2: (Give Birth = no) (Live in Water = yes) → Fishes
R3: (Give Birth = yes) (Blood Type = warm) → Mammals
R4: (Give Birth = no) (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
72
Characteristics of rules
◼ Mutually exclusive rules
◼ Two rule conditions can’t be true at the same
time
◼ Every record is covered by at most one rule
◼ Exhaustive rules
◼ Classifier rules account for every possible
combination of attribute values
◼ Each record is covered by at least one rule
DB
MG
73
From decision trees to rules
Refund
Classification Rules
Yes No (Refund=Yes) ==> No
NO YES
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
74
Rules can be simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund 1 Yes Single 125K No
Yes No
2 No Married 100K No
NO Marita l 3 No Single 70K No
{Single, Status 4 Yes Married 120K No
{Married}
Divorced}
5 No Divorced 95K Yes
Taxable NO 6 No Married 60K No
Income
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
NO YES 9 No Married 75K No
10 No Single 90K Yes
10
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
75
Effect of rule simplification
◼ Rules are no longer mutually exclusive
◼ A record may trigger more than one rule
◼ Solution?
◼ Ordered rule set
◼ Unordered rule set – use voting schemes
◼ Rules are no longer exhaustive
◼ A record may not trigger any rules
◼ Solution?
◼ Use a default class
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
76
Ordered rule set
◼ Rules are rank ordered according to their priority
◼ An ordered rule set is known as a decision list
◼ When a test record is presented to the classifier
◼ It is assigned to the class label of the highest ranked rule it has
triggered
◼ If none of the rules fired, it is assigned to the default class
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
77
Building classification rules
◼ Direct Method
◼ Extract rules directly from data
◼ e.g.: RIPPER, CN2, Holte’s 1R
◼ Indirect Method
◼ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◼ e.g: C4.5rules
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
78
Evaluation of rule based classifiers
◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Fast model building
Very fast classification
◼ Interpretability ◼
DB
MG
80
Associative classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Associative classification
◼ The classification model is defined by
means of association rules
(Condition) → y
◼ rule body is an itemset
◼ Model generation
◼ Rule selection & sorting
◼ based on support, confidence and correlation
thresholds
◼ Rule pruning
◼ Database coverage: the training set is covered by
DB selecting topmost rules according to previous sort
MG
82
Evaluation of associative classifiers
◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Rule generation may be slow
and rule-based classifiers ◼ It depends on support
threshold
◼ correlation among attributes is
considered ◼ Very fast classification
◼ Interpretability ◼ Scalability
◼ Model and prediction are ◼ Scalable in training set size
interpretable ◼ Reduced scalability in
◼ Incrementality attribute number
Rule generation may become
Not incremental
◼
◼
unfeasible
◼ Robustness
◼ Unaffected by missing data
◼ Robust to outliers
DB
MG
83
K-Nearest Neighbor
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Instance-Based Classifiers
Set of Stored Cases • Store the training records
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
86
Instance Based Classifiers
◼ Examples
◼ Rote-learner
◼ Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
◼ Nearest neighbor
◼ Uses k “closest” points (nearest neighbors) for
performing classification
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
87
Nearest-Neighbor Classifiers
Unknown record Requires
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
88
Definition of Nearest Neighbor
X X X
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
90
Nearest Neighbor Classification
◼ Compute distance between two points
◼ Euclidean distance
d ( p, q ) = ( pi
i
−q ) i
2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
92
Nearest Neighbor Classification
◼ Scaling issues
◼ Attribute domain should be normalized to prevent
distance measures from being dominated by one
of the attributes
◼ Example: height [1.5m to 2.0m] vs. income
[$10K to $1M]
◼ Problem with distance measures
◼ High dimensional data
◼ curse of dimensionality
DB
MG
93
Evaluation of KNN
◼ Accuracy ◼ Efficiency
◼ Comparable to other ◼ (Almost) no model building
classification techniques for ◼ Slower classification,
simple datasets requires computing
distances
◼ Interpretability
◼ Model is not interpretable ◼ Scalability
◼ Single predictions can be ◼ Weakly scalable in training
ˮdescribedˮ by neighbors set size
Curse of dimensionality for
◼ Incrementality ◼
increasing attribute number
◼ Incremental
◼ Training set must be
◼ Robustness
available ◼ Depends on distance
computation
DB
MG
94
Bayesian Classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Bayes theorem
◼ Let C and X be random variables
P(C,X) = P(C|X) P(X)
P(C,X) = P(X|C) P(C)
◼ Hence
P(C|X) P(X) = P(X|C) P(C)
◼ and also
P(C|X) = P(X|C) P(C) / P(X)
DB
MG
96
Bayesian classification
◼ Let the class attribute and all data attributes be random
variables
◼ C = any class label
◼ X = <x1,…,xk> record to be classified
◼ Bayesian classification
◼ compute P(C|X) for all classes
◼ probability that record X belongs to C
◼ assign X to the class with maximal P(C|X)
◼ Applying Bayes theorem
P(C|X) = P(X|C)·P(C) / P(X)
◼ P(X) constant for all C, disregarded for maximum computation
◼ P(C) a priori probability of C
P(C) = Nc/N
DB
MG
97
Bayesian classification
◼ How to estimate P(X|C), i.e. P(x1,…,xk|C)?
◼ Naïve hypothesis
P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C)
◼ statistical independence of attributes x1,…,xk
◼ not always true
◼ model quality may be affected
◼ Computing P(xk|C)
◼ for discrete attributes
P(xk|C) = |xkC|/ Nc
◼ where |xkC| is number of instances having value xk for attribute k
and belonging to class C
◼ for continuous attributes, use probability distribution
◼ Bayesian networks
◼ allow specifying a subset of dependencies among attributes
DB
MG
98
Bayesian classification: Example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
99
Bayesian classification: Example
outlook
P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(p) = 9/14
P(overcast|p) = 4/9 P(overcast|n) = 0 P(n) = 5/14
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
100
Bayesian classification: Example
◼ Data to be labeled
X = <rain, hot, high, false>
◼ For class p
P(X|p)·P(p) =
= P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
◼ For class n
P(X|n)·P(n) =
= P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
101
Evaluation of Naïve Bayes Classifiers
◼ Accuracy ◼ Efficiency
◼ Similar or lower than decision ◼ Fast model building
trees ◼ Very fast classification
Naïve hypothesis simplifies
Scalability
◼
model ◼
DB
MG
102
Support Vector Machines
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Support Vector Machines
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
104
Support Vector Machines
B1
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
105
Support Vector Machines
B2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
106
Support Vector Machines
B2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
107
Support Vector Machines
B1
B2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
108
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
109
Nonlinear Support Vector Machines
◼ What if decision boundary is not linear?
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
110
Nonlinear Support Vector Machines
◼ Transform data into higher dimensional space
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
111
Evaluation of Support Vector Machines
◼ Accuracy ◼ Efficiency
◼ Among best performers ◼ Model building requires
significant parameter tuning
◼ Interpretability ◼ Very fast classification
◼ Model and prediction are not
interpretable ◼ Scalability
◼ Black box model ◼ Medium scalable both in
◼ Incrementality training set size and
attribute number
◼ Not incremental
◼ Robustness
◼ Robust to noise and outliers
DB
MG
112
Artificial Neural Networks
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Artificial Neural Networks
◼ Inspired to the structure of the human
brain
◼ Neurons as elaboration units
◼ Synapses as connection network
DB
MG
114
Artificial Neural Networks
◼ Different tasks, different architectures
image understanding: convolutional NN (CNN) time series analysis: recurrent NN (RNN)
DB
MG
115
Feed Forward Neural Network
DB
MG
116
Structure of a neuron
x0 w0
- mk
x1 w1
f output y
xn wn
DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
118
Activation Functions
◼ Activation
◼ simulates biological activation to input stymula
◼ provides non-linearity to the computation
◼ may help to saturate neuron outputs in fixed ranges
DB
MG
119
Activation Functions
◼ Sigmoid, tanh
◼ saturate input value in a fixed range
◼ non linear for all the input scale
◼ typically used by FFNNs for both hidden and output layers
◼ E.g. sigmoid in output layers allows generating values between 0 and 1
(useful when output must be interpreted as likelihood)
DB
MG
120
Activation Functions
◼ Binary Step
◼ outputs 1 when input is non-zero
◼ useful for binary outputs
◼ issues: not appropriate for gradient descent
◼ derivative not defined in x=0
◼ derivative equal to 0 in every other position
DB
MG
121
Activation Functions
◼ Softmax
◼ differently to other activation functions
◼ it is applied only to the output layer
◼ works by considering all the neurons in the layer
output layer
DB
MG
122
Building a FFNN
◼ For each node, definition of
◼ set of weights
◼ offset value
providing the highest accuracy on the
training data
◼ Iterative approach on training data
instances
DB
MG
123
Building a FFNN
◼ Base algorithm
◼ Initially assign random values to weights and offsets
◼ Process instances in the training set one at a time
◼ For each neuron, compute the result when applying weights,
offset and activation function for the instance
◼ Forward propagation until the output is computed
◼ Compare the computed output with the expected output, and
evaluate error
◼ Backpropagation of the error, by updating weights and offset for
each neuron
◼ The process ends when
◼ % of accuracy above a given threshold
◼ % of parameter variation (error) below a given threshold
◼ The maximum number of epochs is reached
DB
MG
124
Evaluation of Feed Forward NN
◼ Accuracy ◼ Efficiency
◼ Among best performers ◼ Model building requires very
complex parameter tuning
◼ Interpretability ◼ It requires significant time
◼ Model and prediction are not ◼ Very fast classification
interpretable
◼ Black box model ◼ Scalability
Medium scalable both in
◼ Incrementality ◼
training set size and
◼ Not incremental attribute number
◼ Robustness
◼ Robust to noise and outliers
◼ Requires large training set
◼ Otherwise unstable when
tuning parameters
DB
MG
126
Convolutional Neural Networks
◼ Allow automatically extracting features from images and
performing classification
DB
MG
127
Convolutional Neural Networks
feature extraction
DB
MG
128
Convolutional Neural Networks
classification,
with softmax activation
feature extraction
DB
MG
129
Convolutional Neural Networks
◼ Typical convolutional layer
◼ convolution stage: feature extraction by means of (hundreds to
thousands) sliding filters
◼ sliding filters activation: apply activation functions to input tensor
◼ pooling: tensor downsampling
convolution
activation
pooling
DB
MG
130
Convolutional Neural Networks
◼ Tensors
◼ data flowing through CNN layers is represented in the form of tensors
◼ Tensor = N-dimensional vector
◼ Rank = number of dimensions
◼ scalar: rank 0
◼ 1-D vector: rank 1
◼ 2-D matrix: rank 2
◼ Shape = number of elements for each dimension
◼ e.g. a vector of length 5 has shape [5]
◼ e.g. a matrix w x h, w=5, h=3 has shape [h, w] = [3, 5]
DB
MG
131
Convolutional Neural Networks
◼ Images
◼ rank-3 tensors with shape [d,h,w]
◼ where h=height, w=width, d=image depth (1 for grayscale, 3 for RGB
colors)
DB
MG
132
Convolutional Neural Networks
◼ Convolution
◼ processes data in form of tensors (multi-dimensional matrices)
◼ input: input image or intermediate features (tensor)
◼ output: a tensor with the extracted features
pixel value
DB
MG
133
Convolutional Neural Networks
◼ Convolution
◼ a sliding filter produces the values of the output tensor
◼ sliding filters contain the trainable weights of the neural network
◼ each convolutional layer contains many (hundreds) filters
padding
sliding filter
output tensor
input tensor
DB
MG
134
Convolutional Neural Networks
◼ Convolution
◼ images are transformed into features by convolutional filters
◼ after convolving a tensor [d,h,w] with N filters we obtain
◼ a rank-3 tensor with shape [N,h,w]
◼ hence, each filter generates a layer in the depth of the output tensor
DB
MG
135
Convolutional Neural Networks
◼ Activation
◼ symulates biological activation to input stymula
◼ provides non-linearity to the computation
◼ ReLU is typically used for CNNs
◼ faster training (no vanishing gradients)
◼ does not saturate
◼ faster computation of derivatives for backpropagation
ReLU
0
0
DB
MG
136
Convolutional Neural Networks
◼ Pooling
◼ performs tensor downsampling
◼ sliding filter which replaces tensor values with a summary statistic of the
nearby outputs
◼ maxpool is the most common: computes the maximum value as statistic
sliding filter
output tensor
DB
MG
137
Convolutional Neural Networks
◼ Convolutional layers training
◼ during training each sliding filter learns to recognize a particular
pattern in the input tensor
◼ filters in shallow layers recognize textures and edges
◼ filters in deeper layers can recognize objects and parts (e.g. eye,
ear or even faces)
deeper filters
shallow filters
DB
MG
138
Convolutional Neural Networks
◼ Semantic segmentation CNNs
◼ allow assigning a class to each pixel of the input image
◼ composed of 2 parts
◼ encoder network: convolutional layers to extract abstract features
◼ decoder network: deconvolutional layers to obtain the output image from
the extracted features
DB
MG
139
Recurrent Neural Networks
◼ Allow processing sequential data x(t)
◼ Differently from normal FFNN they are able to keep a state which
evolves during time
◼ Applications
◼ machine translation
◼ time series prediction
◼ speech recognition
◼ part of speech (POS) tagging
DB
MG
140
Recurrent Neural Networks
◼ RNN execution during time
DB
MG
141
Recurrent Neural Networks
◼ RNN execution during time
DB
MG
142
Recurrent Neural Networks
◼ RNN execution during time
DB
MG
143
Recurrent Neural Networks
◼ RNN execution during time
DB
MG
144
Recurrent Neural Networks
◼ A RNN receives as input a vector x(t) and the state at previous time
step s(t-1)
◼ A RNN typically contains many neurons organized in different layers
state retroaction
w’
s(t-1)
w
input
x(t)
output
y(t)
internal structure
DB
MG
145
Recurrent Neural Networks
◼ Training is performed with Backpropagation Through Time
◼ Given a pair training sequence x(t) and expected output y(t)
◼ error is propagated through time
◼ weights are updated to minimize the error across all the time steps
DB
MG
146
Recurrent Neural Networks
◼ Issues
◼ vanishing gradient: error gradient decreases rapidly over time, weights
are not properly updated
◼ this makes harder having RNN with long-term memories
LSTM stage
DB
MG
147
Autoencoders
◼ Autoencoders allow compressing input data by means of compact
representations and from them reconstruct the initial input
◼ for feature extraction: the compressed representation can be used as significant
set of features representing input data
◼ for image (or signal) denoising: the image reconstructed from the abstract
representation is denoised with respect to the original one
compressed data
DB
MG
148
Word Embeddings (Word2Vec)
input word
e(man)=[e1,e2,e3]
e1
e2
man e(king)=[e1’,e2’,e3’]
e3
embedding vector
DB
MG
149
Word Embeddings (Word2Vec)
DB
MG
150
Word Embeddings (Word2Vec)
DB
MG
151
Model evaluation
B
D MG
Data Base and Data Mining Group of Politecnico di Torino
Elena Baralis
Politecnico di Torino
Model evaluation
◼ Methods for performance evaluation
◼ Partitioning techniques for training and test sets
◼ Metrics for performance evaluation
◼ Accuracy, other measures
◼ Techniques for model comparison
◼ ROC curve
DB
MG
153
Methods for performance evaluation
◼ Objective
◼ reliable estimate of performance
◼ Performance of a model may depend on
other factors besides the learning algorithm
◼ Class distribution
◼ Cost of misclassification
◼ Size of training and test sets
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
154
Learning curve
Learning curve shows
how accuracy changes
with varying training
sample size
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Effect of small sample size:
- Bias in the estimate
- Variance of estimate
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
155
Partitioning data
◼ Several partitioning techniques
◼ holdout
◼ cross validation
◼ Stratified sampling to generate partitions
◼ without replacement
◼ Bootstrap
◼ Sampling with replacement
DB
MG
156
Methods of estimation
◼ Partitioning labeled data for training,
validation and test
◼ Several partitioning techniques
◼ holdout
◼ cross validation
◼ Stratified sampling to generate partitions
◼ without replacement
◼ Bootstrap
◼ Sampling with replacement
DB
MG
157
Holdout
◼ Fixed partitioning
◼ Typically, may reserve 80% for training, 20%
for test
◼ Other proportions may be appropriate,
depending on the dataset size
◼ Appropriate for large datasets
◼ may be repeated several times
◼ repeated holdout
DB
MG
158
Cross validation
◼ Cross validation
◼ partition data into k disjoint subsets (i.e., folds)
◼ k-fold: train on k-1 partitions, test on the
remaining one
◼ repeat for all folds
◼ reliable accuracy estimation, not appropriate for
very large datasets
◼ Leave-one-out
◼ cross validation for k=n
◼ only appropriate for very small datasets
DB
MG
159
Model performance estimation
◼ Model training step
◼ Building a new model
◼ Model validation step
◼ Hyperparameter tuning
◼ Algorithm selection
◼ Model test step
◼ Estimation of model performance
DB
MG
160
Model performance estimation
◼ Typical dataset size
◼ Training set 60% of labeled data
◼ Validation set 20% of labeled data
◼ Test set 20% of labeled data
◼ Splitting labeled data
◼ Use hold-out to split in
◼ training+validation
◼ test
◼ Use cross validation to split in
◼ training
◼ validation
DB
MG
161
Metrics for model evaluation
◼ Evaluate the predictive accuracy of a model
◼ Confusion matrix
◼ binary classifier
PREDICTED CLASS
Class=Yes Class=No
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
162
Accuracy
◼ Most widely-used metric for model
evaluation
Accuracy = Number of correctly classified objects
Number of classified objects
DB
MG
163
Accuracy
◼ For a binary classifier
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
164
Limitations of accuracy
◼ Consider a binary problem
◼ Cardinality of Class 0 = 9900
◼ Cardinality of Class 1 = 100
◼ Model
() → class 0
◼ Model predicts everything to be class 0
◼ accuracy is 9900/10000 = 99.0 %
◼ Accuracy is misleading because the model
does not detect any class 1 object
DB
MG
165
Limitations of accuracy
◼ Classes may have different importance
◼ Misclassification of objects of a given class is
more important
◼ e.g., ill patients erroneously assigned to the
healthy patients class
◼ Accuracy is not appropriate for
◼ unbalanced class label distribution
◼ different class relevance
DB
MG
166
Class specific measures
◼ Evaluate separately for each class C
◼ Maximize
2rp
F - measure (F) =
r+ p
DB
MG
167
Class specific measures
◼ For a binary classification problem
◼ on the confusion matrix, for the positive class
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r+ p 2a + b + c
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
168
ROC (Receiver Operating Characteristic)
◼ Developed in 1950s for signal detection
theory to analyze noisy signals
◼ characterizes the trade-off between positive hits
and false alarms
◼ ROC curve plots
◼ TPR, True Positive Rate (on the y-axis)
TPR = TP/(TP+FN)
against
◼ FPR, False Positive Rate (on the x-axis)
DB
MG
169
ROC curve
(FPR, TPR)
◼ (0,0): declare everything
to be negative class
◼ (1,1): declare everything
to be positive class
◼ (0,1): ideal
◼ Diagonal line
◼ Random guessing
◼ prediction is opposite of
the true class
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
170
How to build a ROC curve
◼ Use classifier that produces
Instance P(+|A) True Class posterior probability for each
1 0.95 + test instance P(+|A)
2 0.93 + ◼ Sort the instances according to
3 0.87 - P(+|A) in decreasing order
4 0.85 -
◼ Apply threshold at each unique
5 0.85 - value of P(+|A)
6 0.85 +
◼ Count the number of TP, FP,
7 0.76 -
TN, FN at each threshold
8 0.53 +
◼ TP rate
9 0.43 -
TPR = TP/(TP+FN)
10 0.25 +
◼ FP rate
FPR = FP/(FP + TN)
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
171
How to build a ROC curve
Class + - + - - - + - + +
P(+|A)
P
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
172
Using ROC for Model Comparison
◼ No model consistently
outperforms the other
◼ M1 is better for
small FPR
◼ M2 is better for
large FPR
◼ Area under ROC curve
◼ Ideal
Area = 1.0
◼ Random guess
Area = 0.5
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
173