19 Classification 1
19 Classification 1
Data Mining
Chapter 6: Classification
Introduction
Part 2:
• Decision Trees (ID3, C4.5) –descriptive
• Neural Networks- statistical
• Bayesian Networks - statistical
• Rough Sets - descriptive
• Genetic Algorithms – descriptive or
statistical- but mainly an optimization method
<=30
30…40
high
high
no
no
excellent
fair
no
yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Classification Data 2
(with objects)
rec Age Income Student Credit_rating Buys_computer
REMARK
A class C can have many
characteristics, i.e many characteristic
descriptions
buys_computer= yes
buys_computer= no
• A formula
• Age<=30 & Income=low is NOT a characteristic
description
of the class C2 = {r: buys_computer=no }
because:
{ r: Age<=30 & Income=low } /\ {r: buys_computer=no }= emptyset
Characteristic Formula
Any formula of a form
<=30
30…40
high
high
no
no
excellent
fair
no
yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Characteristic Rule
EXAMPLE: : given classification Data 1, 2
The formula
• IF buys_computer= no THEN income = low &
student=yes & credit=excellent
Is a characteristic rule for our database because
{r: buys_computer= no } = {r1,r2, r6, r8, r16 }
{r: income = low & student=yes &
credit=excellent } = {r6,r7}
and
{r1,r2, r6, r8, r16 } /\ {r6,r7} = not empty set
Characteristic Rule
EXAMPLE: : given classification Data 1, 2
The formula
• IF buys_computer= no THEN income = low &
credit=fair
IS NOT a characteristic rule for our database because
{r: buys_computer= no } = {r1,r2, r6, r8, r16 }
{r: income = low & credit=fair} = {r5, r9 }
and
{r1,r2, r6, r8, r16 } /\ {r5,r9} = empty set
Discrimination
• Discrimination is the process which aim is to find
rules that allow us to discriminate the objects
(records) belonging to a given class from the rest
of records ( classes)
If characteristics then class
• Example : given classification Data 1, 2
• If Age=<= 30 & income=high & student=no &
credit_rating=fair then buys_computer= no
Discriminant Formula
Discriminant Formula Definition
knowledge
LEARNING
Preprocessing Rules or
Descriptions
Processed Data
SELECTION
Target data
Data 33
Classification
Data 1
• Classification Data Format: a data table with key attribute removed.
• Special attribute, called a class attribute is buys_computer
<=30
30…40
high
high
no
no
excellent
fair
no
yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
A small, full set DISCRIMINANT RULES for classes: buys_comp=yes,
buys_comp=no
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations.
– New data is classified based on a tested classifier
Supervised vs. Unsupervised Learning
• Unsupervised learning (clustering)
– The class labels of training data are
unknown
– We are given a set of records
(measurements, observations, etc. )
– with the aim of establishing the existence
of classes or clusters in the data