Unit V Classification
Unit V Classification
Decision trees.
Naïve Bayes
Rule-based classification
Classification by back
propagation
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
July 17, 2015 Data Mining: Concepts and Techniques 8
Supervised vs. Unsupervised Learning
Rule-based classification
Classification by back
propagation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Rule-based classification
Classification by back
propagation
age?
<=30 overcast
31..40 >40
no yes no yes
m
Info ( D )=−∑ p i lo g 2 ( p i )
i= 1
cat_df_flights_replace.replace(replace_map_comp,
inplace=True) print(cat_df_flights_replace.head())
cat_df_flights_lc['carrier’]=cat_df_flights_lc['carrier'].cat.codes
cat_df_flights_lc.head() #alphabetically labeled from 0 to 10
pandas .get_dummies()
'
cat_df_flights_onehot = cat_df_flights.copy()
cat_df_flights_onehot =
pd.get_dummies(cat_df_flights_onehot,
columns=['carrier'], prefix = ['carrier'])
print(cat_df_flights_onehot.head())
4 4 6 6 4 4
SplitInfo A ( D )=− ×log2 ( )− ×log 2 ( )− ×log 2 ( )=0. 926
14 14 14 14 14 14
July 17, 2015 Data Mining: Concepts and Techniques 37
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. Gain ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
the splitting attribute
Gini index measures the impurity of D,a data partition or set of training
tuples
For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Reduction in Impurity:
Δg ini ( A )= gini ( D )− gini A ( D )
gini( D)=1−
14
− ( )( )
9 2 5 2
14
=0 .459
needs
July 17, 2015
to be maximized 53
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_com
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
July 17, 2015 Data Mining: Concepts and Techniques 54
Naïve Bayesian Classifier: An Example
X = (age <= 30 , income = medium, student = yes,
credit_rating = fair)
Classification:
predicts categorical class labels
E.g., Personal homepage classification
xi = (x1, x2, x3, …), yi = +1 or –1
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = n, y Y = {+1, –1}
We want a function f: X Y
July 17, 2015 Data Mining: Concepts and Techniques 67
Linear Classification
Binary Classification
problem
The data above the red
line belongs to class ‘x’
x The data below red line
x x
x x belongs to class ‘o’
x Examples: SVM,
x x x o
Perceptron, Probabilistic
o
x o Classifiers
o o o
o o o
o o o o