Chapter 4 Classification
Chapter 4 Classification
1
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model 2
Prediction Problems:
Classification vs. Numeric Prediction
• Classification
– Predicts categorical class labels (discrete or nominal)
– Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Numeric Prediction
– Models continuous-valued functions,
– i.e., predicts unknown or missing values
3
Models and Patterns
x f(x)
• Model = abstract representation of a given
training data 1 1
e.g., very simple linear model structure
Y=aX+b 2 4
– a and b are parameters determined from the data 3 9
– Y = aX + b is the model structure
– Y = 0.9X + 0.3 is a particular model 4 16
• Pattern represents “local structure” in a dataset
5 ?
–E.g., if X>x then Y >y with probability p
• Example: Given a finite sample, <x,f(x)> pairs, create a
model that can hold for future values?
✓To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too.
5
Predictive Modeling: Customer Scoring
• Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a function
of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees,
etc
– Many, many applications of this nature 6
Classification
properties 3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
10
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships
among data)
11
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior:
– People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
• Example: football player behavior
– If player A is in the game, player B’s scoring rate increases
from 25% chance per game to 95% chance per game
• What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDC
BBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCC
ACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCAD
ADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACB
DDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABC
CBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBC
DCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBC
BDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDD
BDDCABACBCADCDCBAAADCADDADAABBACCBB
12
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
15
Classification: Definition
• Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
– Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
• For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
16
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formula
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
17
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
20
K-Nearest Neighbors
• K-nearest neighbor is a supervised learning algorithm where
the result of new instance query is classified based on
majority of K-nearest neighbor category.
• The purpose of this algorithm is to classify a new object based
on attributes and training samples: (xn, f(xn)), n=1..N.
• Given a query point, we find K number of objects or (training
points) closest to the query point.
– The classification is using majority vote among the classification
of the K objects.
– K Nearest neighbor algorithm used neighborhood classification
as the prediction value of the new query instance.
• K nearest neighbor algorithm is very simple. It works based on
minimum distance from the query instance to the training
samples to determine the K-nearest neighbors.
21
How to compute K-Nearest Neighbor (KNN)
Algorithm?
• Determine parameter K = number of nearest neighbors
• Calculate the distance between the query-instance and all
the training samples
– We can use Euclidean distance
• Sort the distance and determine nearest neighbors based
on the Kth minimum distance
• Gather the category of the nearest neighbors
• Use simple majority of the category of nearest neighbors
as the prediction value of the query instance
– Any ties can be broken at random.
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
Dist ( X , Y ) = ( Xi − Yi )
23
i =1
k-Nearest Neighbors (k-NN)
▪ K-NN is an algorithm that can be used when you have a
bunch of objects that have been classified or labeled in some
way, and other similar objects that haven’t gotten
classified or labeled yet, and you want a way to
automatically label them
▪ The objects could be data scientists who have been
classified as “active” or “passive”; or people who have
been labeled as “high credit” or “low credit”; or restaurants
that have been labeled “five star,” “four star,” “three star,”
“two star,” “one star,” or if they really suck, “zero stars.”
▪ More seriously, it could be patients who have been
classified as “high cancer risk” or “low cancer risk.”
24
K-NN
25
26
K-Nearest Neighbor(K-NN)
• K-NN is a Supervised machine learning while K-
NN is a classification or regression machine
learning algorithm
27
K-Nearest Neighbor Classification (KNN)
• What if a new guy comes in who is 57 years
old and who makes $37,000? What’s his
likely credit rating label?
28
Distance Function Measurements
29
Hamming Distance
• For category variables, Hamming distance
can be used.
30
K-Nearest-Neighbors
31
What is the most possible label for C?
39
Decision Trees
• Decision tree constructs a tree where internal nodes are simple
decision rules on one or more attributes and leaf nodes are
predicted class labels.
✓Given an instance of an object or situation, which is specified by a
set of properties, the tree returns a "yes" or "no" decision about
that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
• Information Gain
–Select the attribute with the highest information gain, that create
small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a total of
n examples of which n+ are positive and n- are negative and it is
given by:
n+ n+ n− n−
D(n+ , n− ) = − log 2 − log 2 = Entropy( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n)
– D(0,m) = D(m,0) = 0
✓D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1
✓D(S)=1 means that half the examples in S are of one
class and half are in the opposite class
Information Gain
• The Information Gain measures the expected reduction
in entropy due to splitting on an attribute A
k ni
GAINsplit = Entropy( S ) − Entropy(i )
i =1 n
Parent Node, S is split into k partitions;
ni is number of records in partition i
3 3 5 5
= D(3+ ,5− ) = − log 2 − log 2 = 0.954
8 8 8 8
Which decision variable minimises the
disorder?
Test Average Disorder of the other attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
✓ Gain(hair) = 0.954 - 0.50 = 0.454
✓ Gain(height) = 0.954 - 0.69 =0.264
✓ Gain(weight) = 0.954 - 0.94 =0.014
✓ Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Emily Alex
? Pete
John
Sunburned = Sarah, Annie, None
= Dana, Katie
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sarah, Dana,
Annie Katie
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Exercise: Decision Tree for “buy computer or not”. Use the
training Dataset given below to construct decision tree
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
53
>40 medium no excellent no
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes yes
54
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification methods)
• Convertible to simple and easy to understand classification if-then-
else rules
• Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution, works
well on noisy data.
Pros Cons
+ Reasonable training time - Cannot handle complicated
+ Fast application relationship between features
+ Easy to interpret - Simple decision boundaries
+ Easy to implement - Problems with lots of missing
+ Can handle large number of data
features
56
Bayesian Learning
Why Bayesian Classification?
• Provides practical learning algorithms
– Probabilistic learning: Calculate explicit probabilities
for hypothesis. E.g. Naïve Bayes
• Prior knowledge and observed data can be combined
– Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
• It is a generative (model based) approach, which offers a
useful conceptual framework
– Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities. E.g. sequences could also be classified,
based on a probabilistic model specification
– Any kind of objects can be classified, based on a probabilistic
model specification
CONDITIONAL PROBABILITY
• Probability : How likely is it that an event will happen?
• Sample Space S
– An event A and C are a subset of S
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
cool 3/9 1/5
Play-tennis example
• Based on the model created, predict Play Tennis or Not for the
following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
C NB = arg max P(C ) P(at | C )
C[ yes, no] t
= arg max P(C ) P(Outl = sunny | C ) P(Temp = cool | C ) P( Hum = high | C ) P(Wind = strong | C )
C[ yes, no]
• Working:
P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes) = 0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no) = 0.0206
answer : PlayTennis = no
70
Brain vs. Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
71
Features of the Brain
• Ten billion (1010) neurons
✓ Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
72
Neural Network
0 if y 0
y=
1 if y 0
–Update the weights according to the error.
W j = W j + * ( yT − y ) * x j
ANN Training Example
Given the following two inputs x1, x2; Bias 1st input 2nd input Target
find equation that helps to draw the (x1) (x2) output
boundary? -1 0 0 0
• Let say we have the following initializations:
W1(0) = 0.92, W2(0) = 0.62, -1 1 0 0
W0(0) = 0.22, ή = 0.1 -1 0 1 1
-1 1 1 1
• Training – epoch 1:
y1 = 0.92*0 + 0.62*0 – 0.22 = -0.22 → y = 0
y2 = 0.92*1 + 0.62*0 – 0.22 = 0.7 → y =1 X
W1(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82
W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32
y3 = 0.82*0 + 0.62*1 – 0.32 = 0.3 → y = 1
y4 = 0.82*1 + 0.62*1 – 0.32 = 1.12 → y =1
ANN Training Example
• Training – epoch 2:
y1 = 0.82*0 + 0.62*0 – 0.32 = -0.32 → y= 0
y2 = 0.82*1 + 0.62*0 – 0.32 = 0.5 → y= 1 X
W1(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72
W2(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(2) = 0.32 + 0.1 * (0 – 1) * (-1)= 0.42
y3 = 0.72*0 + 0.62*1 – 0.42 = 0.2 → y= 1
y4 = 0.72*1 + 0.62*1 – 0.42 = 0.92 → y = 1
• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 → y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 → y = 1 X
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1→ y = 1
y4 = 0.62*1 + 0.62*1 – 0.52 = 0.72→ y = 1
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52 → y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10→ y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
y3 = 0.52*0 + 0.62*1 – 0.62 = 0 → y = 0 X
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 → y = 1
• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 → y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 → y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 → y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 → y= 1
ANN Training Example
1+ + +
1 +
x2 x2
0o x1
o o
0 o
1 x1 1
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech and
image recognition
Cons
Pros
-Slow training time
+ Can learn more complicated - Hard to interpret & understand
class boundaries the learned function (weights)
+ Fast application -Hard to implement: trial & error for
+ Can handle large number of choosing number of nodes
features
91