For Classification Models
For Classification Models
Classification Models
Attribute Selection Measures
There are two popular attribute selection
measures:
1. information gain,
2. Gini index.
Decision Tree : Using “Information Gain”
ID3 uses information gain as its attribute selection measure.
Remember…Earlier slides..
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐷)=− 𝑝 1 ( 𝑙𝑜𝑔2 𝑝 1 ) − 𝑝 2 ( 𝑙𝑜𝑔 2 𝑝 2 )
Use Excel
Impure set
Impure set
Pure Set
Attribute Selection Measures : Gini Index
using youth
youth
high
high
no
no
fair
excellent
no
no
Gini Index
Go to Excel Calculate
age income student credt_rating Class:buy_computer
Gini Index
age yes no Total p1 p2 Gini
middle-aged 4 4 1 0 0
Go to Excel Calculate senior 3 2 5 0.6 0.4 0.48
youth 2 3 5 0.4 0.6 0.48
Gini Index for “Root node”
• Calculate Gini index for ALL attributes
Gini :
Income=0.27
Student=0.23
Credit-rating=0.30
Age at “Root node”+ further split
age Class:buy_
age income student credt_rating computer
senior medium no excellent no
senior medium no fair yes
Middle-age senior/youth youth high no excellent no
youth high no fair no
youth medium no fair no
yes student senior low yes excellent no
senior low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
no yes youth low yes fair yes
Gini : Gini :
Income=0.09 Income=0.09
age=0.07 age=0.09
Credit-rating=0.09 Credit-rating=0.07
Age at “Root node” + further split
age
Class:buy_
age income student credt_rating computer
senior medium no excellent no
Middle-age senior/youth
senior medium no fair yes
youth high no excellent no
yes student youth high no fair no
youth medium no fair no
no yes senior low yes excellent no
senior low yes fair yes
youth senior Split further senior medium yes fair yes
youth medium yes excellent yes
no
Credit=excelle
Credit=fair
youth low yes fair yes
nt
no yes
Gini :
Income=0.09
age=0.09
Credit-rating=0.07
age
senior/yout
Middle-age
h
yes student
no yes
Credit=excel
no
lent
Credit=fair excel fair
no yes
Decision Tree : Case -Telecom Customer churn
Customer Attrition
Customer attrition, also known as customer churn, customer turnover, or
customer defection, is the loss of clients or customers.
Companies from these sectors often have customer service branches which
attempt to win back defecting clients, because recovered long-term customers
can be worth much more to a company than newly recruited clients.
Decision Tree : Case -Telecom Customer churn
Dataset : Telcom Customer Churn
df <- read.csv(file.choose(),header = T)
str(df)
# data cleaning ( str )
churn$customerID=NULL
df$SeniorCitizen=as.integer(df$SeniorCitizen)
df$SeniorCitizen=as.factor(df$SeniorCitizen)
str(df$SeniorCitizen)
Decision Tree : Case -Telecom Customer churn
str(churn)
# data cleaning ( str )
library(plyr)
Summary(df)
Decision Tree : Case -Telecom Customer churn
df$SeniorCitizen=as.integer(df$SeniorCitizen)
df$SeniorCitizen=as.factor(df$SeniorCitizen)
str(df$SeniorCitizen)
Decision Tree : Case -Telecom Customer churn
dev.new()
boxplot(df$tenure~df$Churn)
#Model – Starts
set.seed(123)
rno=sample(nrow(df),nrow(df)*0.7)
dtree1=rpart(Churn~.,data = trn,
method = 'class')
# Tree plot
library(rattle)
dev.new()
fancyRpartPlot(dtree1,type = 3)
# Predict & Confusion Matrix
tst$predProb=predict(dtree1,newdata = tst)
str(trn$Churn)
tst$pred=ifelse(tst$predProb>0.5,'Yes','No')
str(tst$pred)
tst$pred=factor(tst$pred[1:3701],levels = c('Yes','No'))
library(caret)
confusionMatrix(tst$pred,tst$Churn)
Random Forest – Ensemble Method Ensemble learning helps improve
machine learning results by combining
several models. This approach allows
the production of better predictive
performance compared to a single
model.
Bagging OR Bootstrap Bootstrap Training Decision Tree
Sample in RF Sample-1
(70%) of Original Data
700 records
randomly draw datasets with replacement
from the Original data, each sample the Decision Tree
Bootstrap Training
same size as the training set Sample-2
Training (70%) of Original Data
700 records
Data Set (70%)
Original Data Set 700 records Bootstrap Training
Decision Tree
Sample-3
1000 records (70%) of Original Data
Testing 700 records
confusionMatrix(rfPred,tst$Churn)
Model Evaluation :
Decision Tree vs. Random Forest
Accuracy of RF is > DT RF is best model
Ref.pos Ref.neg
Ref.pos Ref.neg Pred.pos 684 202
TPR Pred.pos 647 120
Pred.neg 16
700
98
300
Pred.neg 53 180
(True Positive Rate) 700 300 0.98 0.67
0.92 0.40
Ref.pos Ref.neg
Pred.pos TP FP
Pred.neg FN TN
TPR=TP/(TP+FN) FPR=FP/(FP+TN)
dev.new(2)
plot.roc(tst$Churn,tst$rfPredProb[1:3701],
print.auc=T,main="Random Forest")
Model Evaluation :
Decision Tree vs. Random Forest