Unit 4- Classification and Prediction
Unit 4- Classification and Prediction
Classification and
Prediction
1
Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification based on concepts from association
rule mining
Other Classification Methods
Prediction
Classification accuracy
Summary
2
Classification and Prediction
3
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data.
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
4
Classification—A Two-Step
Process
1. Model construction: describing a set of predetermined
classes
5
Classification—A Two-Step
Process
1. Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label
attribute
The set of tuples used for model construction is
training set
The model is represented as classification rules,
decision trees, or mathematical formulae
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
8
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data are accompanied
by labels indicating the class of the observations
New data is classified based on the training set
9
Classification and Prediction
10
Issues Regarding Classification and
Prediction: 1. Data Preparation
Issues are :
1. Data Preparation: helps improve efficiency, accuracy and
scalability of classification and prediction
2. Comparing Classification Methods
A. Data cleaning
Preprocess data in order to reduce noise and handle missing
values
C. Data transformation
Generalize and/or normalize data. Concept hierarchies can
be used. 11
Issues regarding classification and
prediction:
2. Comparing Classification Methods
1. Predictive accuracy :
ability of the model to correctly predict the class label
of new/unseen data.
2. Speed
time to construct the model
time to use the model
3. Robustness
handling noise and missing values
4. Scalability
efficiency of the model for large databases
5. Interpretability:
understanding the insight provided by the model
6. Goodness of rules
decision tree size
compactness of classification rules
12
Classification and Prediction
13
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
This 31…40 high no fair yes
follows >40 medium no fair yes
>40 low yes fair yes
an >40 low yes excellent no
example 31…40 low yes excellent yes
from <=30 medium no fair no
Quinlan’s <=30 low yes fair yes
>40 medium yes fair yes
ID3 <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
14
Output: A Decision Tree for “buys_computer”
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes excellent
no fair
yes
no
age?
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes <=30 overcast
31..40 >40
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
15
Basic algorithm for inducing a decision
tree from training tuples
16
Basic algorithm for inducing a decision tree
from training tuples (contd..)
17
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures: info required to classify any arbitrary
tuple
m
si si
I( s1,s2,...,sm ) log 2
i 1 s s
18
Attribute Selection by Information
Gain Computation
Class P: buys_computer =
“yes”
5 4
Class N: buys_computer = “no” E (age) I (2,3) I (4,0)
I(p, n) = I(9, 5) =0.940
14 14
Compute the entropy for age: 5
I (3,2) 0.694
14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has
<=30 2 3 0.971 14
5 out of 14 samples, with 2
30…40 4 0 0 yes’es and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) I ( p, n) E (age) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 19
Tree Pruning: Avoid Overfitting in
Classification
Overfitting: An induced tree may overfit the training
data
Too many branches, some may reflect anomalies
21
Approaches to Determine
the Final Tree Size
22
Enhancements to basic decision
tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued
attributes that partition the continuous attribute
value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones
that are sparsely represented
This reduces fragmentation, repetition, and
replication
23
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
24
Scalable Decision Tree Induction Methods
in Data Mining Studies
SLIQ
builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT
constructs an attribute list data structure
PUBLIC
integrates tree splitting and tree pruning: stop growing the
tree earlier
RainForest
separates the scalability aspects from the criteria that
determine the quality of the tree
builds an AVC-list (attribute, value, class label)
25
Classification and Prediction
26
Classification and Prediction
27
Bayesian Classification
Bayesian classifiers are statistical classifiers.
28
Bayesian Theorem Basics
Given training data X, posteriori probability of a hypothesis H, P(H|X)
follows the Bayes theorem
P(H | X ) P( X | H )P(H )
Informally, this can be written as
P( X )
posterior =likelihood x prior / evidence
Let
X- a data sample whose class label is unknown
H- hypothesis that X belongs to class C
30
Naive Bayesian classification
Suppose
1. Each data sample, X = (x1,x2,…,xn),
depicting n measurements for n attributes,
respectively A1,A2,…,An
2. There are m classes, C1,C2,…,Cm.
31
By Bayes theorem P(Ci/X)= P(X/Ci) * P(Ci)
P(X)
Probability of X conditioned
on Ci is given by
32
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 30…40 high no fair yes
C1:buys_computer= >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer= >40 low yes excellent no
‘no’
31…40 low yes excellent yes
Data sample(unknown) <=30 medium no fair no
X =(age<=30, <=30 low yes fair yes
Income=medium, >40 medium yes fair yes
Student=yes <=30 medium yes excellent yes
Credit_rating= 31…40 medium no excellent yes
Fair)
31…40 high yes fair yes
>40 medium no excellent no
33
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
Data sample X =(age<=30, Income=medium, Student=yes Credit_rating=
Fair)
P(buys_computer=“yes”)=9/14=0.643
P(buys_computer=“no”)=5/14=0.357
35
Naïve Bayesian Classifier:
Comments
Advantages :
1. Easy to implement
2. Good results obtained in most of the cases
Disadvantages
1. Assumption: class conditional independence , therefore loss
of accuracy
2. Practically, dependencies exist among variables
1. E.g., hospitals: patients: Profile: age, family history etc
2. Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
3. Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
36
Bayesian Networks
distribution
Nodes: random variables
Links: dependency
the parent of P
No dependency between Z and P
Z Has no loops or cycles
P
37
Bayesian Belief Network: An
Example
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
Family
Smoker
History
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
n
PositiveXRay Dyspnoea P ( z1,..., zn) P ( z i | Parents ( Z i ))
i 1
Bayesian Belief Networks
Where tuple(z1,…,zn) correspond to attributes Z1,Z2,…,Zn) 38
Learning Bayesian Networks
Several cases
1. Given both the network structure and all variables
observable: learn only the CPTs
39
Classification and Prediction
40
Classification by Neural Networks
(Backpropagation)
“What is backpropagation?"
Its a neural network learning algorithm.
Neural Network
Its a set of connected input/output units where each
41
Multilayer Feed-forward Neural
Network
A training sample, X = (x1; x2; …; xi), is fed to the input layer.
The weighted outputs of these units are, in turn, fed simultaneously to
a second layer, known as a hidden layer.
The hidden layer's weighted outputs can be input to another hidden
layer, and so on. The number of hidden layers is arbitrary, although in
practice, usually only one is used.
The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction for
given samples.
Weighted connections
exist between each
layer, where wij
denotes the weight
from a unit j in one
layer to a unit i in the
previous layer.
42
Benefits and drawbacks of Neural
Networks
The network is feed-forward in that none of the weights
cycle back to an input unit or to an output unit of a previous
layer.
Benefits:
include their high tolerance to noisy data
able to classify patterns on which they have not
been trained.
Disadvantages:
involve long training times, so more suitable for
applications where this is feasible.
Need to know network topology or “structure".
poor interpretability, as its difficult for humans to
interpret the symbolic meaning behind the learned
weights.
43
Defining a network topology
Before training can begin, the user must decide
on the network topology by specifying
1. the number of units in the input layer,
2. the number of hidden layers (if more than one),
3. the number of units in each hidden layer, and
4. the number of units in the output layer.
44
Backpropagation
“What is backpropagation?"
Its a neural network learning algorithm.
45
Multi-Layer Perceptron
Output vector
Err j O j (1 O j ) Errk w jk
Output nodes k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden nodes Err j O j (1 O j )(T j O j )
wij 1
Oj Ij
1 e
Input nodes
I j wij Oi j
i
Input vector: xi
Given a unit j in a hidden or
output layer, the net input, x1
Ij, to unit j is:
I j wij Oi j x2 i j k
i wij
=Bias(threshold) .
j
xn
.
Given the net input Ij to
unit j, then Oj, the output . 1
of unit j, is computed as: O j I
1 e j
Backpropagate the error. The error is propagated
backwards by updating the weights and biases to
reflect the error of the network's prediction. Tj=true
o/p Err j O j (1 O j )(T j O j ) Err j O j (1 O j ) Errk w jk
k
47
Network Training
The ultimate objective of training
obtain a set of weights that makes almost all the
Steps
Initialize weights with random values
Compute the net input to the unit as a linear
combination of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Chapter 7. Classification and
Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by back propagation
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary
49
Other Classification Methods
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
50
Instance-Based (Lazy Learners)
Methods
Instance-based learning:
Store training examples and delay the processing
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Case-based reasoning
Uses symbolic representations and knowledge-
based inference
51
Remarks on Lazy vs. Eager Learning
K Nearest Neighbor & Case based reasoning : lazy evaluation
Decision-tree and Bayesian classification: eager evaluation
Key differences
Lazy Learning Eager Learning
52
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space.
53
Case-Based Reasoning
Uses: lazy evaluation + analyze similar instances
Instances are not “points in a Euclidean space”
Methodology
Instances represented by rich symbolic
descriptions, or as cases.
Earlier similar cases are seen. Multiple retrieved
Research issues
Indexing based on syntactic similarity measure,
56
Fuzzy Set
Approaches
For a given new sample, more than one fuzzy value may
apply
58
The k-Nearest Neighbor Algorithm
• The k-nearest-neighbor method was first described in the early 1950s. The
method is labor intensive when given large training sets, and did not gain
popularity until the 1960s
62
Chapter 7. Classification and
Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by back propagation
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary
67
What Is Prediction?
68
Regress Analysis and Log-Linear
Models in Prediction
Linear regression: Y = + X
Two parameters , and specify the line and are to be
…, X1, X2, ….
69
Non-linear models:
Polynomial regression can be modeled by adding
70
Chapter 7. Classification and
Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification based on concepts from association
rule mining
Other Classification Methods
Prediction
Classification accuracy
Summary
71
Summary