0% found this document useful (0 votes)
26 views160 pages

DSTBD_10-DMClassification-ENG

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views160 pages

DSTBD_10-DMClassification-ENG

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

Classification fundamentals

B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Classification
◼ Objectives
◼ prediction of a class label
◼ definition of an interpretable model of a given
phenomenon

training data

model

unclassified data classified data

DB
MG
2
Classification
◼ Applications
◼ detection of customer propension to leave a company
(churn or attrition)
◼ fraud detection
◼ classification of different pathology types
◼ …

training data

model

unclassified data classified data

DB
MG
5
Classification: definition
◼ Given
◼ a collection of class labels
◼ a collection of data objects labelled with a
class label
◼ Find a descriptive profile of each class,
which will allow the assignment of
unlabeled objects to the appropriate class

DB
MG
6
Definitions
◼ Training set
◼ Collection of labeled data objects used to learn
the classification model
◼ Test set
◼ Collection of labeled data objects used to
validate the classification model

DB
MG
7
Classification techniques
◼ Decision trees
◼ Classification rules
◼ Association rules
◼ Neural Networks
◼ Naïve Bayes and Bayesian Networks
◼ k-Nearest Neighbours (k-NN)
◼ Support Vector Machines (SVM)
◼ …
DB
MG
8
Evaluation of classification techniques

◼ Accuracy ◼ Efficiency
◼ quality of the prediction ◼ model building time
classification time
◼ Interpretability ◼

◼ model interpretability ◼ Scalability


◼ model compactness ◼ training set size
attribute number
◼ Incrementality ◼

◼ model update in presence of ◼ Robustness


newly labelled record ◼ noise, missing data

DB
MG
9
Decision trees
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Example of decision tree

Tid Refund Marital Taxable


Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
11
Another example of decision tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
12
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
13
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
14
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
16
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
17
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
18
Decision tree induction
◼ Many algorithms to build a decision tree
◼ Hunt’s Algorithm (one of the earliest)
◼ CART
◼ ID3, C4.5, C5.0
◼ SLIQ, SPRINT

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
19
General structure of Hunt’s algorithm
Tid Refund Marital Taxable
Status Income Cheat

Basic steps 1
2
3
Yes
No
No
Single
Married
Single
125K
100K
70K
No
No
No

◼ If Dt contains records that belong


4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No

to more than one class


7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes

select the “best” attribute A on which


10


to split Dt and label node t as A
Dt,, set of training records
◼ split Dt into smaller subsets and that reach a node t
recursively apply the procedure to
each subset Dt
◼ If Dt contains records that belong
to the same class yt t
◼ then t is a leaf node labeled as yt
◼ If Dt is an empty set
◼ then t is a leaf node labeled as the
default (majority) class, yd

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
20
Hunt’s algorithm
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund 2 No Married 100K No
Don’t
Yes No 3 No Single 70K No
Cheat
Don’t 4 Yes Married 120K No
Don’t
Cheat Cheat 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No

Refund 8 No Single 85K Yes


Refund
Yes No 9 No Married 75K No
Yes No
10 No Single 90K Yes
Don’t Don’t Marital
10

Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
21
Decision tree induction
◼ Adopts a greedy strategy
◼ “Best” attribute for the split is selected locally at
each step
◼ not a global optimum
◼ Issues
◼ Structure of test condition
◼ Binary split versus multiway split
◼ Selection of the best attribute for the split
◼ Stopping condition for the algorithm

DB
MG
22
Structure of test condition
◼ Depends on attribute type
◼ nominal
◼ ordinal
◼ continuous
◼ Depends on number of outgoing edges
◼ 2-way split
◼ multi-way split

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
23
Splitting on nominal attributes

◼ Multi-way split
◼ use as many partitions as distinct values
CarType
Family Luxury
◼ Binary split Sports

◼ Divides values into two subsets


◼ Need to find optimal partitioning

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
24
Splitting on ordinal attributes
◼ Multi-way split
◼ use as many partitions as distinct values
◼ Binary split Small
Size
Large
◼ Divides values into two subsets Medium

◼ Need to find optimal partitioning


{Small,
Size OR Size
{Medium,
Medium} {Large} {Small}
Large}

Size
{Small,
What about this split? Large} {Medium}

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
25
Splitting on continuous attributes
◼ Different techniques
◼ Discretization to form an ordinal categorical
attribute
◼ Static – discretize once at the beginning
◼ Dynamic – discretize during tree induction
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering
◼ Binary decision (A < v) or (A  v)
◼ consider all possible splits and find the best cut
◼ more computationally intensive
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
26
Splitting on continuous attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
27
Selection of the best attribute
Before splitting: 10 records of class 0,
10 records of class 1
Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which attribute (test condition) is the best?

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
28
Selection of the best attribute
◼ Attributes with homogeneous class
distribution are preferred
◼ Need a measure of node impurity

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous, low


high degree of impurity degree of impurity

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
29
Measures of node impurity
◼ Many different measures available
◼ Gini index
◼ Entropy
◼ Misclassification error
◼ Different algorithms rely on different
measures

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
30
How to find the best attribute
Before Splitting: C0 N00
M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
31
GINI impurity measure
◼ Gini Index for a given node t
GINI (t ) = 1 −  [ p( j | t )]2
j

p( j | t) is the relative frequency of class j at node t

◼ Maximum (1 - 1/nc) when records are equally distributed


among all classes, implying higher impurity degree
◼ Minimum (0.0) when all records belong to one class,
implying lower impurity degree
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
32
Examples for computing GINI
GINI (t ) = 1 −  [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
33
Splitting based on GINI
◼ Used in CART, SLIQ, SPRINT
◼ When a node p is split into k partitions (children), the
quality of the split is computed as
k
ni
GINI split =  GINI (i )
i =1 n

where
ni = number of records at child i
n = number of records at node p

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
34
Computing GINI index: Boolean attribute
◼ Splits into two partitions
◼ larger and purer partitions are sought for
Parent
B? C1 6
C2 6
Yes No
Gini = 0.500
Node N1 Node N2

Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(split on B)
= 0.408 C1 5 1 = 7/12 * 0.408 +
C2 2 4 5/12 * 0.32
Gini(N2)
Gini=? = 0.371
= 1 – (1/5)2 – (4/5)2
= 0.32

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
35
Computing GINI index: Categorical attribute
◼ For each distinct value, gather counts for each class in the
dataset
◼ Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
36
Computing GINI index: Continuous attribute
Tid Refund Marital Taxable
Status Income Cheat
◼ Binary decision on one 1 Yes Single 125K No

splitting value 2 No Married 100K No

◼ Number of possible splitting 3 No Single 70K No

values
4 Yes Married 120K No
5 No Divorced 95K Yes
= Number of distinct values 6 No Married 60K No

◼ Each splitting value v has a 7 Yes Divorced 220K No

count matrix
8 No Single 85K Yes
9 No Married 75K No

◼ class counts in the two


10 No Single 90K Yes
10

partitions Taxable
Income
◼ A<v > 80K?
◼ Av
Yes No

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
37
Computing GINI index: Continuous attribute
◼ For each attribute
◼ Sort the attribute on values
◼ Linearly scan these values, each time updating the count matrix
and computing gini index
◼ Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220

Split Positions 55 65 72 80 87 92 97 110 122 172 230


<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
38
Entropy impurity measure (INFO)
◼ Entropy at a given node t
Entropy(t ) = − p( j | t ) log p( j | t )
j 2

p( j | t) is the relative frequency of class j at node t


◼ Maximum (log nc) when records are equally
distributed among all classes, implying higher
impurity degree
◼ Minimum (0.0) when all records belong to one
class, implying lower impurity degree
◼ Entropy based computations are similar to
GINI index computations
DB
M G 39
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Examples for computing entropy
Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
40
Splitting Based on INFO
◼ Information Gain

GAIN = Entropy( p) −   Entropy(i ) 


 n k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
◼ Measures reduction in entropy achieved because of the
split. Choose the split that achieves most reduction
(maximizes GAIN)
◼ Used in ID3 and C4.5
◼ Disadvantage: Tends to prefer splits yielding a
large number of partitions, each small but pure
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
41
Splitting Based on INFO
◼ Gain Ratio
GAIN n n
GainRATIO = SplitINFO = −  log Split
k
i i
split
SplitINFO n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

◼ Adjusts Information Gain by the entropy of the


partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized
◼ Used in C4.5
◼ Designed to overcome the disadvantage of
Information Gain
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
42
Comparison among splitting criteria
For a 2-class problem

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
45
Stopping Criteria for Tree Induction
◼ Stop expanding a node when all the records
belong to the same class

◼ Stop expanding a node when all the records


have similar attribute values

◼ Early termination
◼ Pre-pruning
◼ Post-pruning
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
46
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
47
Overfitting due to Noise

Decision boundary is distorted by noise point

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
48
How to address overfitting
◼ Pre-Pruning (Early Stopping Rule)
◼ Stop the algorithm before it becomes a fully-grown tree

◼ Typical stopping conditions for a node

◼ Stop if all instances belong to the same class


◼ Stop if all the attribute values are the same
◼ More restrictive conditions
◼ Stop if number of instances is less than some user-specified
threshold
◼ Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
◼ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain)

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
49
How to address overfitting
◼ Post-pruning
◼ Grow decision tree to its entirety
◼ Trim the nodes of the decision tree in a bottom-
up fashion
◼ If generalization error improves after trimming,
replace sub-tree by a leaf node.
◼ Class label of leaf node is determined from
majority class of instances in the sub-tree

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
50
Data fragmentation
◼ Number of instances gets smaller as you
traverse down the tree

◼ Number of instances at the leaf nodes could


be too small to make any statistically
significant decision

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
51
Handling missing attribute values
◼ Missing values affect decision tree
construction in three different ways
◼ Affect how impurity measures are computed
◼ Affect how to distribute instances with missing
value to child nodes
◼ Affect how a test instance with missing value is
classified

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
52
Decision boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5 y < 0.47? y < 0.33?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
56
Oblique decision trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
57
Evaluation of decision trees

◼ Accuracy ◼ Efficiency
◼ For simple datasets, ◼ Fast model building
comparable to other ◼ Very fast classification
classification techniques ◼ Scalability
◼ Interpretability ◼ Scalable both in training set
◼ Model is interpretable for size and attribute number
small trees ◼ Robustness
◼ Single predictions are ◼ Difficult management of
interpretable missing data
◼ Incrementality
◼ Not incremental

DB
MG
59
Random Forest
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Random Forest
◼ Ensemble learning technique
◼ multiple base models are combined
◼ to improve accuracy and stability
◼ to avoid overfitting

◼ Random forest = set of decision trees


◼ a number of decision trees are built at training
time
◼ the class is assigned by majority voting

DB
MG Bibliography: Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springer, 2009 61
Random Forest
Original Training data Dataset

… Dj …
Random subsets D1 DB

Multiple decision trees … …


For each subset, a tree is
learned on a random set
of features

Aggregating classifiers Majority voting

Class

DB
MG
62
Bootstrap aggregation
◼ Given a training set D of n instances, it selects B
times a random sample with replacement from D
and trains trees on these dataset samples

◼ For b = 1, ..., B
◼ Sample with replacement n’ training examples, n’≤n

◼ A dataset subset Db is generated

◼ Train a classification tree on Db

DB
MG
63
Feature bagging
◼ Selects, for each candidate split in the learning
process, a random subset of the features
◼ Being p the number of features, √𝑝 features are typically
selected
◼ Trees are decorrelated
◼ Feature subsets are sampled randomly, hence different
features can be selected as best attributes for the split

DB
MG
64
Random Forest – Algorithm Recap

DB
MG
65
Evaluation of random forests

◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Fast model building
Very fast classification
◼ Interpretability ◼

◼ Model and prediction are not ◼ Scalability


interpretable ◼ Scalable both in training set
◼ A prediction may be given by size and attribute number
hundreds of trees
◼ Robustness
◼ Provide global feature
importance ◼ Robust to noise and outliers
◼ an estimate of which features
are important in the classification
◼ Incrementality
◼ Not incremental

DB
MG
67
Rule-based classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Rule-based classifier
◼ Classify records by using a collection of “if…then…”
rules
◼ Rule: (Condition) → y
◼ where
◼ Condition is a conjunction of simple predicates
◼ y is the class label
◼ LHS: rule antecedent or condition
◼ RHS: rule consequent
◼ Examples of classification rules
◼ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
◼ (Taxable Income < 50K)  (Refund=Yes) → Cheat=No

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
69
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
70
Rule-based classification
◼ A rule r covers an instance x if the attributes
of the instance satisfy the condition of the
rule
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

Rule R1 covers a hawk => Bird


Rule R3 covers the grizzly bear => Mammal

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
71
Rule-based classification
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers (only) rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
72
Characteristics of rules
◼ Mutually exclusive rules
◼ Two rule conditions can’t be true at the same
time
◼ Every record is covered by at most one rule

◼ Exhaustive rules
◼ Classifier rules account for every possible
combination of attribute values
◼ Each record is covered by at least one rule

DB
MG
73
From decision trees to rules

Refund
Classification Rules
Yes No (Refund=Yes) ==> No

NO Marita l (Refund=No, Marital Status={Single,Divorced},


Status Taxable Income<80K) ==> No
{Single,
{Married}
Divorced} (Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
Taxable NO
Income (Refund=No, Marital Status={Married}) ==> No
< 80K > 80K

NO YES

Rules are mutually exclusive and exhaustive


Rule set contains as much information as the
tree

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
74
Rules can be simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund 1 Yes Single 125K No
Yes No
2 No Married 100K No
NO Marita l 3 No Single 70K No
{Single, Status 4 Yes Married 120K No
{Married}
Divorced}
5 No Divorced 95K Yes
Taxable NO 6 No Married 60K No
Income
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
NO YES 9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No)  (Status=Married) → No


Simplified Rule: (Status=Married) → No

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
75
Effect of rule simplification
◼ Rules are no longer mutually exclusive
◼ A record may trigger more than one rule
◼ Solution?
◼ Ordered rule set
◼ Unordered rule set – use voting schemes
◼ Rules are no longer exhaustive
◼ A record may not trigger any rules
◼ Solution?
◼ Use a default class

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
76
Ordered rule set
◼ Rules are rank ordered according to their priority
◼ An ordered rule set is known as a decision list
◼ When a test record is presented to the classifier
◼ It is assigned to the class label of the highest ranked rule it has
triggered
◼ If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
77
Building classification rules
◼ Direct Method
◼ Extract rules directly from data
◼ e.g.: RIPPER, CN2, Holte’s 1R

◼ Indirect Method
◼ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◼ e.g: C4.5rules

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
78
Evaluation of rule based classifiers

◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Fast model building
Very fast classification
◼ Interpretability ◼

◼ Model and prediction are ◼ Scalability


interpretable ◼ Scalable both in training set
size and attribute number
◼ Incrementality
◼ Not incremental ◼ Robustness
◼ Robust to outliers

DB
MG
80
Associative classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Associative classification
◼ The classification model is defined by
means of association rules
(Condition) → y
◼ rule body is an itemset
◼ Model generation
◼ Rule selection & sorting
◼ based on support, confidence and correlation
thresholds
◼ Rule pruning
◼ Database coverage: the training set is covered by
DB selecting topmost rules according to previous sort
MG
82
Evaluation of associative classifiers

◼ Accuracy ◼ Efficiency
◼ Higher than decision trees ◼ Rule generation may be slow
and rule-based classifiers ◼ It depends on support
threshold
◼ correlation among attributes is
considered ◼ Very fast classification
◼ Interpretability ◼ Scalability
◼ Model and prediction are ◼ Scalable in training set size
interpretable ◼ Reduced scalability in
◼ Incrementality attribute number
Rule generation may become
Not incremental


unfeasible
◼ Robustness
◼ Unaffected by missing data
◼ Robust to outliers

DB
MG
83
K-Nearest Neighbor
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Instance-Based Classifiers
Set of Stored Cases • Store the training records

……... • Use training records to


Atr1 AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
86
Instance Based Classifiers
◼ Examples
◼ Rote-learner
◼ Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
◼ Nearest neighbor
◼ Uses k “closest” points (nearest neighbors) for
performing classification

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
87
Nearest-Neighbor Classifiers
Unknown record Requires
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
88
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
89
1 nearest-neighbor
Voronoi Diagram

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
90
Nearest Neighbor Classification
◼ Compute distance between two points
◼ Euclidean distance

d ( p, q ) =  ( pi
i
−q ) i
2

◼ Determine the class from nearest neighbor


list
◼ take the majority vote of class labels among the
k-nearest neighbors
◼ Weigh the vote according to distance
◼ weight factor, w = 1/d2
DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
91
Nearest Neighbor Classification
◼ Choosing the value of k:
◼ If k is too small, sensitive to noise points
◼ If k is too large, neighborhood may include points from
other classes

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
92
Nearest Neighbor Classification
◼ Scaling issues
◼ Attribute domain should be normalized to prevent
distance measures from being dominated by one
of the attributes
◼ Example: height [1.5m to 2.0m] vs. income
[$10K to $1M]
◼ Problem with distance measures
◼ High dimensional data
◼ curse of dimensionality

DB
MG
93
Evaluation of KNN

◼ Accuracy ◼ Efficiency
◼ Comparable to other ◼ (Almost) no model building
classification techniques for ◼ Slower classification,
simple datasets requires computing
distances
◼ Interpretability
◼ Model is not interpretable ◼ Scalability
◼ Single predictions can be ◼ Weakly scalable in training
ˮdescribedˮ by neighbors set size
Curse of dimensionality for
◼ Incrementality ◼
increasing attribute number
◼ Incremental
◼ Training set must be
◼ Robustness
available ◼ Depends on distance
computation

DB
MG
94
Bayesian Classification
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Bayes theorem
◼ Let C and X be random variables
P(C,X) = P(C|X) P(X)
P(C,X) = P(X|C) P(C)
◼ Hence
P(C|X) P(X) = P(X|C) P(C)
◼ and also
P(C|X) = P(X|C) P(C) / P(X)

DB
MG
96
Bayesian classification
◼ Let the class attribute and all data attributes be random
variables
◼ C = any class label
◼ X = <x1,…,xk> record to be classified
◼ Bayesian classification
◼ compute P(C|X) for all classes
◼ probability that record X belongs to C
◼ assign X to the class with maximal P(C|X)
◼ Applying Bayes theorem
P(C|X) = P(X|C)·P(C) / P(X)
◼ P(X) constant for all C, disregarded for maximum computation
◼ P(C) a priori probability of C
P(C) = Nc/N

DB
MG
97
Bayesian classification
◼ How to estimate P(X|C), i.e. P(x1,…,xk|C)?
◼ Naïve hypothesis
P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C)
◼ statistical independence of attributes x1,…,xk
◼ not always true
◼ model quality may be affected
◼ Computing P(xk|C)
◼ for discrete attributes
P(xk|C) = |xkC|/ Nc
◼ where |xkC| is number of instances having value xk for attribute k
and belonging to class C
◼ for continuous attributes, use probability distribution
◼ Bayesian networks
◼ allow specifying a subset of dependencies among attributes
DB
MG
98
Bayesian classification: Example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
99
Bayesian classification: Example
outlook
P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(p) = 9/14
P(overcast|p) = 4/9 P(overcast|n) = 0 P(n) = 5/14
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5

DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
100
Bayesian classification: Example
◼ Data to be labeled
X = <rain, hot, high, false>
◼ For class p
P(X|p)·P(p) =
= P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
◼ For class n
P(X|n)·P(n) =
= P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286

DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
101
Evaluation of Naïve Bayes Classifiers

◼ Accuracy ◼ Efficiency
◼ Similar or lower than decision ◼ Fast model building
trees ◼ Very fast classification
Naïve hypothesis simplifies
Scalability

model ◼

◼ Interpretability ◼ Scalable both in training set


size and attribute number
◼ Model and prediction are not
interpretable ◼ Robustness
◼ The weights of contributions in a ◼ Affected by attribute
single prediction may be used to correlation
explain
◼ Incrementality
◼ Fully incremental
◼ Does not require availability
of training data

DB
MG
102
Support Vector Machines
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Support Vector Machines

◼ Find a linear hyperplane (decision boundary) that will separate the


data

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
104
Support Vector Machines
B1

◼ One Possible Solution

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
105
Support Vector Machines

B2

◼ Another possible solution

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
106
Support Vector Machines

B2

◼ Other possible solutions

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
107
Support Vector Machines
B1

B2

◼ Which one is better? B1 or B2?


◼ How do you define better?

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
108
Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

◼ Find hyperplane maximizes the margin => B1 is better than B2

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
109
Nonlinear Support Vector Machines
◼ What if decision boundary is not linear?

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
110
Nonlinear Support Vector Machines
◼ Transform data into higher dimensional space

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
111
Evaluation of Support Vector Machines

◼ Accuracy ◼ Efficiency
◼ Among best performers ◼ Model building requires
significant parameter tuning
◼ Interpretability ◼ Very fast classification
◼ Model and prediction are not
interpretable ◼ Scalability
◼ Black box model ◼ Medium scalable both in
◼ Incrementality training set size and
attribute number
◼ Not incremental
◼ Robustness
◼ Robust to noise and outliers

DB
MG
112
Artificial Neural Networks
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Artificial Neural Networks
◼ Inspired to the structure of the human
brain
◼ Neurons as elaboration units
◼ Synapses as connection network

DB
MG
114
Artificial Neural Networks
◼ Different tasks, different architectures
image understanding: convolutional NN (CNN) time series analysis: recurrent NN (RNN)

numerical vectors classification: feed forward NN (FFNN) denoising: auto-encoders

DB
MG
115
Feed Forward Neural Network

input vector (xi) output vector

neuron weighted connection


( wij )

DB
MG
116
Structure of a neuron

x0 w0
- mk
x1 w1

 f output y
xn wn

Input Weight Weighted Activation


vector x vector w sum function

DB
MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
118
Activation Functions
◼ Activation
◼ simulates biological activation to input stymula
◼ provides non-linearity to the computation
◼ may help to saturate neuron outputs in fixed ranges

DB
MG
119
Activation Functions
◼ Sigmoid, tanh
◼ saturate input value in a fixed range
◼ non linear for all the input scale
◼ typically used by FFNNs for both hidden and output layers
◼ E.g. sigmoid in output layers allows generating values between 0 and 1
(useful when output must be interpreted as likelihood)

DB
MG
120
Activation Functions
◼ Binary Step
◼ outputs 1 when input is non-zero
◼ useful for binary outputs
◼ issues: not appropriate for gradient descent
◼ derivative not defined in x=0
◼ derivative equal to 0 in every other position

◼ ReLU (Rectified Linear Unit)


◼ used in deep networks (e.g. CNNs)
◼ avoids vanishing gradient
◼ does not saturate
◼ neurons activate linearly only for positive input

DB
MG
121
Activation Functions
◼ Softmax
◼ differently to other activation functions
◼ it is applied only to the output layer
◼ works by considering all the neurons in the layer

◼ after softmax, the output vector can be interpreted as a discrete


distribution of probabilities
◼ e.g. the probabilities for the input pattern of belonging to each class

output layer

DB
MG
122
Building a FFNN
◼ For each node, definition of
◼ set of weights
◼ offset value
providing the highest accuracy on the
training data
◼ Iterative approach on training data
instances

DB
MG
123
Building a FFNN
◼ Base algorithm
◼ Initially assign random values to weights and offsets
◼ Process instances in the training set one at a time
◼ For each neuron, compute the result when applying weights,
offset and activation function for the instance
◼ Forward propagation until the output is computed
◼ Compare the computed output with the expected output, and
evaluate error
◼ Backpropagation of the error, by updating weights and offset for
each neuron
◼ The process ends when
◼ % of accuracy above a given threshold
◼ % of parameter variation (error) below a given threshold
◼ The maximum number of epochs is reached

DB
MG
124
Evaluation of Feed Forward NN

◼ Accuracy ◼ Efficiency
◼ Among best performers ◼ Model building requires very
complex parameter tuning
◼ Interpretability ◼ It requires significant time
◼ Model and prediction are not ◼ Very fast classification
interpretable
◼ Black box model ◼ Scalability
Medium scalable both in
◼ Incrementality ◼
training set size and
◼ Not incremental attribute number
◼ Robustness
◼ Robust to noise and outliers
◼ Requires large training set
◼ Otherwise unstable when
tuning parameters

DB
MG
126
Convolutional Neural Networks
◼ Allow automatically extracting features from images and
performing classification

input image predicted class


with confidence

Convolutional Neural Network (CNN) Architecture

DB
MG
127
Convolutional Neural Networks

feature extraction

low level features abstract features

DB
MG
128
Convolutional Neural Networks

classification,
with softmax activation

feature extraction

low level features abstract features

DB
MG
129
Convolutional Neural Networks
◼ Typical convolutional layer
◼ convolution stage: feature extraction by means of (hundreds to
thousands) sliding filters
◼ sliding filters activation: apply activation functions to input tensor
◼ pooling: tensor downsampling

convolution
activation
pooling

DB
MG
130
Convolutional Neural Networks
◼ Tensors
◼ data flowing through CNN layers is represented in the form of tensors
◼ Tensor = N-dimensional vector
◼ Rank = number of dimensions
◼ scalar: rank 0
◼ 1-D vector: rank 1
◼ 2-D matrix: rank 2
◼ Shape = number of elements for each dimension
◼ e.g. a vector of length 5 has shape [5]
◼ e.g. a matrix w x h, w=5, h=3 has shape [h, w] = [3, 5]

rank-3 tensor with shape


[d,h,w] = [4,2,3]

DB
MG
131
Convolutional Neural Networks

◼ Images
◼ rank-3 tensors with shape [d,h,w]
◼ where h=height, w=width, d=image depth (1 for grayscale, 3 for RGB
colors)

DB
MG
132
Convolutional Neural Networks
◼ Convolution
◼ processes data in form of tensors (multi-dimensional matrices)
◼ input: input image or intermediate features (tensor)
◼ output: a tensor with the extracted features

pixel value

input tensor output tensor

DB
MG
133
Convolutional Neural Networks
◼ Convolution
◼ a sliding filter produces the values of the output tensor
◼ sliding filters contain the trainable weights of the neural network
◼ each convolutional layer contains many (hundreds) filters

padding

sliding filter

output tensor
input tensor

DB
MG
134
Convolutional Neural Networks
◼ Convolution
◼ images are transformed into features by convolutional filters
◼ after convolving a tensor [d,h,w] with N filters we obtain
◼ a rank-3 tensor with shape [N,h,w]

◼ hence, each filter generates a layer in the depth of the output tensor

DB
MG
135
Convolutional Neural Networks
◼ Activation
◼ symulates biological activation to input stymula
◼ provides non-linearity to the computation
◼ ReLU is typically used for CNNs
◼ faster training (no vanishing gradients)
◼ does not saturate
◼ faster computation of derivatives for backpropagation

ReLU

0
0

DB
MG
136
Convolutional Neural Networks
◼ Pooling
◼ performs tensor downsampling
◼ sliding filter which replaces tensor values with a summary statistic of the
nearby outputs
◼ maxpool is the most common: computes the maximum value as statistic

sliding filter

output tensor

DB
MG
137
Convolutional Neural Networks
◼ Convolutional layers training
◼ during training each sliding filter learns to recognize a particular
pattern in the input tensor
◼ filters in shallow layers recognize textures and edges
◼ filters in deeper layers can recognize objects and parts (e.g. eye,
ear or even faces)

deeper filters
shallow filters

DB
MG
138
Convolutional Neural Networks
◼ Semantic segmentation CNNs
◼ allow assigning a class to each pixel of the input image
◼ composed of 2 parts
◼ encoder network: convolutional layers to extract abstract features
◼ decoder network: deconvolutional layers to obtain the output image from
the extracted features

SegNet neural network

DB
MG
139
Recurrent Neural Networks
◼ Allow processing sequential data x(t)
◼ Differently from normal FFNN they are able to keep a state which
evolves during time
◼ Applications
◼ machine translation
◼ time series prediction
◼ speech recognition
◼ part of speech (POS) tagging

DB
MG
140
Recurrent Neural Networks
◼ RNN execution during time

instance of the RNN at time t1

DB
MG
141
Recurrent Neural Networks
◼ RNN execution during time

instance of the RNN at time t2

DB
MG
142
Recurrent Neural Networks
◼ RNN execution during time

instance of the RNN at time t3

DB
MG
143
Recurrent Neural Networks
◼ RNN execution during time

instance of the RNN at time t4

DB
MG
144
Recurrent Neural Networks
◼ A RNN receives as input a vector x(t) and the state at previous time
step s(t-1)
◼ A RNN typically contains many neurons organized in different layers

state retroaction
w’
s(t-1)
w
input
x(t)
output
y(t)
internal structure

DB
MG
145
Recurrent Neural Networks
◼ Training is performed with Backpropagation Through Time
◼ Given a pair training sequence x(t) and expected output y(t)
◼ error is propagated through time
◼ weights are updated to minimize the error across all the time steps

instance of the RNN at time t

unrolled RNN diagram: shows


the same neural network at
different time steps

DB
MG
146
Recurrent Neural Networks
◼ Issues
◼ vanishing gradient: error gradient decreases rapidly over time, weights
are not properly updated
◼ this makes harder having RNN with long-term memories

◼ Solution: LSTM (Long Short Term Memories)


◼ RNN with “gates” which encourage the state information to flow through
long time intervals

LSTM stage

DB
MG
147
Autoencoders
◼ Autoencoders allow compressing input data by means of compact
representations and from them reconstruct the initial input
◼ for feature extraction: the compressed representation can be used as significant
set of features representing input data
◼ for image (or signal) denoising: the image reconstructed from the abstract
representation is denoised with respect to the original one

noisy image reconstructed image

compressed data

DB
MG
148
Word Embeddings (Word2Vec)

◼ Word embeddings associate words to n-dimensional vectors


◼ trained on big text collections to model the word distributions in different
sentences and contexts
◼ able to capture the semantic information of each word
◼ words with similar meaning share vectors with similar characteristics

input word

e(man)=[e1,e2,e3]
e1
e2
man e(king)=[e1’,e2’,e3’]
e3

embedding vector

DB
MG
149
Word Embeddings (Word2Vec)

◼ Since each word is represented with a vector, operations among


words (e.g. difference, addition) are allowed

DB
MG
150
Word Embeddings (Word2Vec)

◼ Semantic relationiships among words are captured by vector positions

king - man = queen - woman


king - man + woman = queen

DB
MG
151
Model evaluation
B
D MG
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
Model evaluation
◼ Methods for performance evaluation
◼ Partitioning techniques for training and test sets
◼ Metrics for performance evaluation
◼ Accuracy, other measures
◼ Techniques for model comparison
◼ ROC curve

DB
MG
153
Methods for performance evaluation
◼ Objective
◼ reliable estimate of performance
◼ Performance of a model may depend on
other factors besides the learning algorithm
◼ Class distribution
◼ Cost of misclassification
◼ Size of training and test sets

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
154
Learning curve
Learning curve shows
how accuracy changes
with varying training
sample size
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Effect of small sample size:
- Bias in the estimate
- Variance of estimate

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
155
Partitioning data
◼ Several partitioning techniques
◼ holdout
◼ cross validation
◼ Stratified sampling to generate partitions
◼ without replacement
◼ Bootstrap
◼ Sampling with replacement

DB
MG
156
Methods of estimation
◼ Partitioning labeled data for training,
validation and test
◼ Several partitioning techniques
◼ holdout
◼ cross validation
◼ Stratified sampling to generate partitions
◼ without replacement
◼ Bootstrap
◼ Sampling with replacement

DB
MG
157
Holdout
◼ Fixed partitioning
◼ Typically, may reserve 80% for training, 20%
for test
◼ Other proportions may be appropriate,
depending on the dataset size
◼ Appropriate for large datasets
◼ may be repeated several times
◼ repeated holdout

DB
MG
158
Cross validation
◼ Cross validation
◼ partition data into k disjoint subsets (i.e., folds)
◼ k-fold: train on k-1 partitions, test on the
remaining one
◼ repeat for all folds
◼ reliable accuracy estimation, not appropriate for
very large datasets
◼ Leave-one-out
◼ cross validation for k=n
◼ only appropriate for very small datasets

DB
MG
159
Model performance estimation
◼ Model training step
◼ Building a new model
◼ Model validation step
◼ Hyperparameter tuning
◼ Algorithm selection
◼ Model test step
◼ Estimation of model performance

DB
MG
160
Model performance estimation
◼ Typical dataset size
◼ Training set 60% of labeled data
◼ Validation set 20% of labeled data
◼ Test set 20% of labeled data
◼ Splitting labeled data
◼ Use hold-out to split in
◼ training+validation
◼ test
◼ Use cross validation to split in
◼ training
◼ validation
DB
MG
161
Metrics for model evaluation
◼ Evaluate the predictive accuracy of a model
◼ Confusion matrix
◼ binary classifier

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b a: TP (true positive)


ACTUAL
b: FN (false negative)
CLASS Class=No c d c: FP (false positive)
d: TN (true negative)

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
162
Accuracy
◼ Most widely-used metric for model
evaluation
Accuracy = Number of correctly classified objects
Number of classified objects

◼ Not always a reliable metric

DB
MG
163
Accuracy
◼ For a binary classifier
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
164
Limitations of accuracy
◼ Consider a binary problem
◼ Cardinality of Class 0 = 9900
◼ Cardinality of Class 1 = 100
◼ Model
() → class 0
◼ Model predicts everything to be class 0
◼ accuracy is 9900/10000 = 99.0 %
◼ Accuracy is misleading because the model
does not detect any class 1 object

DB
MG
165
Limitations of accuracy
◼ Classes may have different importance
◼ Misclassification of objects of a given class is
more important
◼ e.g., ill patients erroneously assigned to the
healthy patients class
◼ Accuracy is not appropriate for
◼ unbalanced class label distribution
◼ different class relevance

DB
MG
166
Class specific measures
◼ Evaluate separately for each class C

Recall (r) = Number of objects correctly assigned to C


Number of objects belonging to C

Precision (p) = Number of objects correctly assigned to C


Number of objects assigned to C

◼ Maximize
2rp
F - measure (F) =
r+ p

DB
MG
167
Class specific measures
◼ For a binary classification problem
◼ on the confusion matrix, for the positive class
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r+ p 2a + b + c

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
168
ROC (Receiver Operating Characteristic)
◼ Developed in 1950s for signal detection
theory to analyze noisy signals
◼ characterizes the trade-off between positive hits
and false alarms
◼ ROC curve plots
◼ TPR, True Positive Rate (on the y-axis)
TPR = TP/(TP+FN)
against
◼ FPR, False Positive Rate (on the x-axis)

FPR = FP/(FP + TN)

DB
MG
169
ROC curve
(FPR, TPR)
◼ (0,0): declare everything

to be negative class
◼ (1,1): declare everything
to be positive class
◼ (0,1): ideal

◼ Diagonal line
◼ Random guessing

◼ Below diagonal line

◼ prediction is opposite of
the true class

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
170
How to build a ROC curve
◼ Use classifier that produces
Instance P(+|A) True Class posterior probability for each
1 0.95 + test instance P(+|A)
2 0.93 + ◼ Sort the instances according to
3 0.87 - P(+|A) in decreasing order
4 0.85 -
◼ Apply threshold at each unique
5 0.85 - value of P(+|A)
6 0.85 +
◼ Count the number of TP, FP,
7 0.76 -
TN, FN at each threshold
8 0.53 +
◼ TP rate
9 0.43 -
TPR = TP/(TP+FN)
10 0.25 +
◼ FP rate
FPR = FP/(FP + TN)

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
171
How to build a ROC curve
Class + - + - - - + - + +
P(+|A)
P
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
172
Using ROC for Model Comparison
◼ No model consistently
outperforms the other
◼ M1 is better for

small FPR
◼ M2 is better for

large FPR
◼ Area under ROC curve
◼ Ideal

Area = 1.0
◼ Random guess
Area = 0.5

DB
MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
173

You might also like