0% found this document useful (0 votes)

51 views

For Classification Models

1. The document discusses different attribute selection measures for classification models, focusing on information gain and Gini index. 2. It describes how decision tree algorithms like ID3, C5.0, and CART use information gain as the attribute selection measure at each node to split the data. 3. The example shows how to build a machine learning model to predict who will buy a computer using the information gain approach on a sample dataset with attributes like age, income, student status, and credit rating.

Uploaded by

Rohit Ghai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

For Classification Models

Uploaded by

Rohit Ghai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

For

Classification Models
Attribute Selection Measures
There are two popular attribute selection
measures:
1. information gain,
2. Gini index.
Decision Tree : Using “Information Gain”
ID3 uses information gain as its attribute selection measure.
Remember…Earlier slides..

This measure is based on Three decision tree algorithms

ID3, C5.0, and CART adopt a
pioneering work by Claude Shannon greedy (i.e., nonbacktracking)
on information theory, which studied approach in which decision trees
are constructed in a top-down
the value or “information content” of recursive divide-and-conquer
messages. manner.
J. Ross Quinlan, a researcher in
machine learning, developed a
decision tree algorithm known as
ID3 (Iterative Dichotomiser).
Case : Build a ML Predictive model “Who will buy Computer ?”
age income student credt_rating Class:buy_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Algorithm : “Information Gain ( using Entropy )
Step -1 : Calculate Info(D) or Entropy(D) where D = Attribute classes of Dataset ( “yes”, “no”)

Calculate Probability of each Class age income student credt_rating Class:buy_computer

yes =9, no=5 , & Total=14 1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
Probability Formula : p = Favourables / Total 4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
Let : p1 = Prob. Of yes = 9/14 = 0.64
7 middle_aged low yes excellent yes
8 youth medium no fair no
Let : p2 = Prob. Of no = 5/14 = 0.36 9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Step -1 : Calculate Info(D) or Entropy(D) where D = Attribute classes of Dataset ( “yes”, “no”)

Let : p1 = Prob. Of yes = 9/14 = 0.64

Let : p2 = Prob. Of no = 5/14 = 0.36

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐷)=− 𝑝 1 ( 𝑙𝑜𝑔2 𝑝 1 ) − 𝑝 2 ( 𝑙𝑜𝑔 2 𝑝 2 )

 Use Excel

Entropy (D) = 0.94

Step -2 : Calculate Info(Age) or Entropy(Age) :
age income student credt_rating Class:buy_computer
middle_aged high no fair yes
middle_aged low yes excellent yes
middle_aged medium no excellent yes
a)middle_aged: b) senior: b) youth: middle_aged high yes fair yes
Total=4 senior medium no fair yes
Total=5 Total=5 senior low yes fair yes
yes = 4 yes = 3 yes = 2 senior low yes excellent no
no = 0 no = 2 no = 3 senior medium yes fair yes
senior medium no excellent no
youth high no fair no
youth high no excellent no
P1=4/4=1.0 P1=3/5=0.6 P1=2/5=0.4 youth medium no fair no
P2=0/4=0 P2=2/5=0.4 P2=3/5=0.6 youth low yes fair yes
youth medium yes excellent yes

Entropy = 0 Entropy = 0.97 Entropy = 0.97

Go to Excel  Calculate
Step -3 : Since Age attribute has THREE subsets

To Calculate Info(Age) or Entropy(Age) : Weighted Average of THREE is done.

a)middle_aged: b) senior: b) youth:

Total = 4 Total = 5 Total = 5 Grand-Total = 14

Entropy = 0 Entropy = 0.97 Entropy = 0.97

Entropy (Age) = (4/14)0 + (5/14)0.97 + (5/14)*0.97 = 0.69

Decision Tree : Using “Information Gain”

Gain (age) = Entropy(D) – Entropy(age)

= 0.94 – 0.69 age
middle_aged
income student credt_rating Class:buy_computer
high no fair yes
middle_aged low yes excellent yes
= 0.25 middle_aged medium no excellent yes
middle_aged high yes fair yes
Similarly, senior
senior medium
low
no
yes
fair
fair
yes
yes

Gain (income) = 0.03 senior

senior
low
medium
yes
yes
excellent
fair
no
yes
senior medium no excellent no
Gain (student) = 0.15 youth high no fair no
youth high no excellent no
Gain (credit_rating) = 0.05 youth
youth
medium
low
no
yes
fair
fair
no
yes
youth medium yes excellent yes
Because age has the highest information gain
among the attributes, it is selected as the splitting attribute.
Attribute Selection Measures : Gini Index
The Gini index is the name of the cost
function used to evaluate splits in the dataset.

A Gini score gives an idea of how good a split

is by how mixed the classes are in the two
groups created by the split.
Attribute Selection Measures : Gini Index
• Gini index measures the impurity of D, a data partition.

Impure set
Impure set

Pure Set
Attribute Selection Measures : Gini Index

A perfect separation results in a Gini score of 0,

whereas
the worst case split that results in 50/50 classes
in each group result in a Gini score of 0.5 (for a
2 class problem).
Income yes no Total p1 p2 Gini
Gini Index High 4 4 1 0 0
Let p1 = probability of ‘yes’ Low 3 3 6 0.5 0.5 0.50
Let p2 = probability of ‘no’ Medium 4 4 8 0.5 0.5 0.50
Case : Marketing age income student credt_rating Class:buy_computer

Decision Tree middle_aged

middle_aged
high
high
no
yes
fair
fair
yes
yes

using youth
youth
high
high
no
no
fair
excellent
no
no

Gini Index middle_aged

senior
low
low
yes
yes
excellent
fair
yes
yes
senior low yes excellent no
youth low yes fair yes
middle_aged medium no excellent yes
Data file : Decision-Tree-Buy-Computer.xlsx senior medium no fair yes
senior medium yes fair yes
senior medium no excellent no

Objective : Find Decision tree to youth medium no fair no

youth medium yes excellent yes

predict whether a customer will buy
computer or not? ( for given
customer info.)
age income student credt_rating Class:buy_computer

Gini Index for all variables middle_aged

middle_aged
youth
high
high
high
no
yes
no
fair
fair
fair
yes
yes
no
youth high no excellent no
middle_aged low yes excellent yes

Let p1 = probability of ‘yes’ senior

senior
low
low
yes
yes
fair
excellent
yes
no
youth low yes fair yes
middle_aged medium no excellent yes

Let p2 = probability of ‘no’ senior

senior
medium
medium
no
yes
fair
fair
yes
yes
senior medium no excellent no
youth medium no fair no
youth medium yes excellent yes

Gini Index

Go to Excel  Calculate
age income student credt_rating Class:buy_computer

Gini Index for all variables middle_aged

middle_aged
youth
high
high
high
no
yes
no
fair
fair
fair
yes
yes
no
youth high no excellent no
middle_aged low yes excellent yes

Let p1 = probability of ‘yes’ senior

senior
low
low
yes
yes
fair
excellent
yes
no
youth low yes fair yes
middle_aged medium no excellent yes

Let p2 = probability of ‘no’ senior

senior
medium
medium
no
yes
fair
fair
yes
yes
senior medium no excellent no
youth medium no fair no
youth medium yes excellent yes

Gini Index
age yes no Total p1 p2 Gini
middle-aged 4 4 1 0 0
Go to Excel  Calculate senior 3 2 5 0.6 0.4 0.48
youth 2 3 5 0.4 0.6 0.48
Gini Index for “Root node”
• Calculate Gini index for ALL attributes 

• Age = 0.34 age income student credt_rating Class:buy_computer

• Income = 0.44 middle_aged
middle_aged
high
high
no
yes
fair
fair
yes
yes
• Student = 0.37 youth
youth
high
high
no
no
fair
excellent
no
no

• Credit_rating =0.43 middle_aged

senior
low
low
yes
yes
excellent
fair
yes
yes

“Age” is lowest cost or GINI senior

youth
low
low
yes
yes
excellent
fair
no
yes

 Split “Age” at root node. middle_aged

senior
medium
medium
no
no
excellent
fair
yes
yes
senior medium yes fair yes
senior medium no excellent no
youth medium no fair no
youth medium yes excellent yes
Age at “Root node”
Class:buy_c
age income student credt_rating omputer
age middle_aged medium no excellent yes
middle_aged low yes excellent yes
middle_aged high no fair yes
middle_aged high yes fair yes
senior medium no excellent no
Middle-age senior/youth senior
senior
low
medium
yes
no
excellent
fair
no
yes
senior low yes fair yes
senior medium yes fair yes
youth high no excellent no
youth medium yes excellent yes
yes Split further youth high no fair no
youth medium no fair no
youth low yes fair yes

Gini :
Income=0.27
Student=0.23
Credit-rating=0.30
Age at “Root node”+ further split
age Class:buy_
age income student credt_rating computer
senior medium no excellent no
senior medium no fair yes
Middle-age senior/youth youth high no excellent no
youth high no fair no
youth medium no fair no
yes student senior low yes excellent no
senior low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
no yes youth low yes fair yes

Split further Split further

Gini : Gini :
Income=0.09 Income=0.09
age=0.07 age=0.09
Credit-rating=0.09 Credit-rating=0.07
Age at “Root node” + further split
age
Class:buy_
age income student credt_rating computer
senior medium no excellent no
Middle-age senior/youth
senior medium no fair yes
youth high no excellent no
yes student youth high no fair no
youth medium no fair no
no yes senior low yes excellent no
senior low yes fair yes
youth senior Split further senior medium yes fair yes
youth medium yes excellent yes
no
Credit=excelle
Credit=fair
youth low yes fair yes
nt

no yes

Gini :
Income=0.09
age=0.09
Credit-rating=0.07
age

senior/yout
Middle-age
h

yes student

no yes

youth senior Credit-rating

Credit=excel
no
lent
Credit=fair excel fair

no yes senior youth yes

no yes
Decision Tree : Case -Telecom Customer churn
Customer Attrition
Customer attrition, also known as customer churn, customer turnover, or
customer defection, is the loss of clients or customers.

Telephone service companies, Internet service providers, pay TV companies,

insurance firms, and alarm monitoring services, often use customer attrition
analysis and customer attrition rates as one of their key business metrics because
the cost of retaining an existing customer is far less than acquiring a new one.

Companies from these sectors often have customer service branches which
attempt to win back defecting clients, because recovered long-term customers
can be worth much more to a company than newly recruited clients.
Decision Tree : Case -Telecom Customer churn
Dataset : Telcom Customer Churn

Each row represents a customer, each column contains customer’s

attributes described on the column Metadata.

The raw data contains 7043 rows (customers)

and
20 columns (features).

The “Churn” column is our target.

Decision Tree : Case -Telecom Customer churn

df <- read.csv(file.choose(),header = T)
str(df)
# data cleaning ( str )
churn$customerID=NULL
df$SeniorCitizen=as.integer(df$SeniorCitizen)
df$SeniorCitizen=as.factor(df$SeniorCitizen)
str(df$SeniorCitizen)
Decision Tree : Case -Telecom Customer churn
str(churn)
# data cleaning ( str )
library(plyr)

df$OnlineBackup <- revalue(df$OnlineBackup,

c("No internet service"="No"))
summary(df$OnlineBackup)
Decision Tree : Case -Telecom Customer churn
df$OnlineMovies <- revalue(df$OnlineMovies, c("No internet service"="No"))
df$OnlineTV <- revalue(df$OnlineTV, c("No internet service"="No"))
df$TechnicalHelp <- revalue(df$TechnicalHelp, c("No internet service"="No"))
df$DeviceProtectionService <- revalue(df$DeviceProtectionService,
c("No internet service"="No"))
df$OnlineSecurity <- revalue(df$OnlineSecurity, c("No internet service"="No"))

df$MultipleConnections <- revalue(df$MultipleConnections,

c("No phone service"="No"))

Summary(df)
Decision Tree : Case -Telecom Customer churn
df$SeniorCitizen=as.integer(df$SeniorCitizen)
df$SeniorCitizen=as.factor(df$SeniorCitizen)
str(df$SeniorCitizen)
Decision Tree : Case -Telecom Customer churn

# check for NAs,

sapply(churn, function(x) sum(is.na(x)))

# EDA
library(gmodels)
CrossTable(df$Churn,df$gender,
prop.chisq = FALSE,
prop.c = F,
prop.t = F,
chisq = T)

dev.new()
boxplot(df$tenure~df$Churn)
#Model – Starts

set.seed(123)
rno=sample(nrow(df),nrow(df)*0.7)

trn <- df[rno, ]

tst <- df[-rno, ]
library(rpart.plot)
library(rpart)

dtree1=rpart(Churn~.,data = trn,
method = 'class')
# Tree plot
library(rattle)

dev.new()
fancyRpartPlot(dtree1,type = 3)
# Predict & Confusion Matrix
tst$predProb=predict(dtree1,newdata = tst)
str(trn$Churn)

tst$pred=ifelse(tst$predProb>0.5,'Yes','No')
str(tst$pred)
tst$pred=factor(tst$pred[1:3701],levels = c('Yes','No'))

library(caret)
confusionMatrix(tst$pred,tst$Churn)
Random Forest – Ensemble Method Ensemble learning helps improve
machine learning results by combining
several models. This approach allows
the production of better predictive
performance compared to a single
model.
Bagging OR Bootstrap Bootstrap Training Decision Tree
Sample in RF Sample-1
(70%) of Original Data
700 records
randomly draw datasets with replacement
from the Original data, each sample the Decision Tree
Bootstrap Training
same size as the training set Sample-2
Training (70%) of Original Data
700 records
Data Set (70%)
Original Data Set 700 records Bootstrap Training
Decision Tree

Sample-3
1000 records (70%) of Original Data
Testing 700 records

Data Set (30%)

Bootstrap Training Decision Tree
300 records Sample-4
(70%) of Original Data
700 records….. So on..
Many tress grown randomly  Random Forest
Bootstrap Training Decision Tree
Random Sample-1
(70%) of Original Data
700 records Take the
majority vote
Bootstrap Training
Decision Tree (a
Random Sample-2 committe
(70%) of Original Data e of trees
Original Data Set 700 records each
Decision Tree cast a
1000 records Bootstrap Training
vote for
Random Sample-3
(70%) of Original Data the
700 records predicted
class )
Bootstrap Training Decision Tree
Random Sample-4
(70%) of Original Data
# Random Forest
library(randomForest)
rf <-randomForest(Churn~.,data=trn)
print(rf)

tst$rfPredProb= predict(rf, newdata=tst,type = 'prob')

rfPred= predict(rf, newdata=tst)

confusionMatrix(rfPred,tst$Churn)
Model Evaluation :
Decision Tree vs. Random Forest
Accuracy of RF is > DT  RF is best model

But threshold used was 0.50 for ‘Yes’ and ‘No’

To…check accuracy at various thresholds ( 0 to 1 )

Plot ROC curve & calculate AUC ( Area Under Curve )

ROC – Curve : Various Thresholds
Ref.pos Ref.neg
Pred.pos 588 94
Pred.neg 112 206
700 300
0.84 0.31

Ref.pos Ref.neg
Ref.pos Ref.neg Pred.pos 684 202
TPR Pred.pos 647 120

Pred.neg 16
700
98
300
Pred.neg 53 180
(True Positive Rate) 700 300 0.98 0.67
0.92 0.40

Ref.pos Ref.neg
Pred.pos TP FP
Pred.neg FN TN
TPR=TP/(TP+FN) FPR=FP/(FP+TN)

FPR ( False Positive Rate )

# ROC curve & AUC
dev.new(1)
plot.roc(tst$Churn,tst$predProb[1:3701],
print.auc=T,main="Decision Tree")

dev.new(2)
plot.roc(tst$Churn,tst$rfPredProb[1:3701],
print.auc=T,main="Random Forest")
Model Evaluation :
Decision Tree vs. Random Forest

1. Accuracy of RF > DT  RF is best model

2. AUC of RF > DT  RF is the Best Model

P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
Decision Trees
No ratings yet
Decision Trees
31 pages
dm 3
No ratings yet
dm 3
37 pages
Assignment-Decision Tree
No ratings yet
Assignment-Decision Tree
12 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
dm4
No ratings yet
dm4
68 pages
Mod 3 part1_merged
No ratings yet
Mod 3 part1_merged
101 pages
LECTURE 8
No ratings yet
LECTURE 8
81 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
Attribute Selection Measures: Decision Tree Based Classification
No ratings yet
Attribute Selection Measures: Decision Tree Based Classification
16 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Example_Classification
No ratings yet
Example_Classification
71 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
CH 5
No ratings yet
CH 5
81 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Construction of Decision Tree Attribute Selection Measures
No ratings yet
Construction of Decision Tree Attribute Selection Measures
5 pages
Data II_ Decision Trees and Rules
No ratings yet
Data II_ Decision Trees and Rules
11 pages
Unit 10 - Decision Trees
No ratings yet
Unit 10 - Decision Trees
21 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Attribute Selection Measure
No ratings yet
Attribute Selection Measure
3 pages
Decision Tree
No ratings yet
Decision Tree
47 pages
08ClassBasic-L
No ratings yet
08ClassBasic-L
78 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
Classification and Regression Trees (CART - III) : DR A. Ramesh
No ratings yet
Classification and Regression Trees (CART - III) : DR A. Ramesh
42 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Lecture 9
No ratings yet
Lecture 9
21 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Classification Intr DT .Pptx
No ratings yet
Classification Intr DT .Pptx
31 pages
Data Minning Unit 5 PDF
No ratings yet
Data Minning Unit 5 PDF
19 pages
Class Basic
No ratings yet
Class Basic
67 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Data Mining - Lecture 5
No ratings yet
Data Mining - Lecture 5
33 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
MIS416 Chapter6 by DrAsimAlwabel
No ratings yet
MIS416 Chapter6 by DrAsimAlwabel
73 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
Learning Decision Trees
No ratings yet
Learning Decision Trees
10 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
21 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
No ratings yet
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
25 pages
04 Classification
No ratings yet
04 Classification
72 pages
Classification
No ratings yet
Classification
45 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
classification-by-decision-tree-induction
No ratings yet
classification-by-decision-tree-induction
25 pages
Class Basic
No ratings yet
Class Basic
75 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
05 Classification
No ratings yet
05 Classification
79 pages
Sorting Out Behaviour: A Head Teacher's Guide
From Everand
Sorting Out Behaviour: A Head Teacher's Guide
Jeremy Rowe
No ratings yet
20 Things You Must Know Before You Lose Your Children’s Affections Forever! (Parenting and Raising Kids)
From Everand
20 Things You Must Know Before You Lose Your Children’s Affections Forever! (Parenting and Raising Kids)
Wayne Evans
4/5 (1)
CRM Sess 7, 8 Prof Nadkarni
No ratings yet
CRM Sess 7, 8 Prof Nadkarni
5 pages
5.3 What Does An Option Contract Look Like
No ratings yet
5.3 What Does An Option Contract Look Like
2 pages
Difference Between Long Position and Short Position
No ratings yet
Difference Between Long Position and Short Position
1 page
CRM Sess 5, 6 Prof Nadkarni
No ratings yet
CRM Sess 5, 6 Prof Nadkarni
6 pages
Topic 6 - Provisions Regarding Advance Tax Under Income Tax Act 1961 - Corporate Tax Planning
No ratings yet
Topic 6 - Provisions Regarding Advance Tax Under Income Tax Act 1961 - Corporate Tax Planning
9 pages
Topic 8 - Environmental Pollution - Corporate Tax Planning
No ratings yet
Topic 8 - Environmental Pollution - Corporate Tax Planning
58 pages
Evaluating A Company's Resources, Capabilities, and Competitiveness
No ratings yet
Evaluating A Company's Resources, Capabilities, and Competitiveness
74 pages
Topic 2 - The Benfits and Limitations of Digital Currency in India - Corporate Tax Planning
No ratings yet
Topic 2 - The Benfits and Limitations of Digital Currency in India - Corporate Tax Planning
34 pages
Blue Ocean Strategy - Presentation
100% (1)
Blue Ocean Strategy - Presentation
29 pages
WM Insurance Set 3 Insurance Pricing
No ratings yet
WM Insurance Set 3 Insurance Pricing
10 pages
WM Insurance Set 1 Risk Mangement Process
No ratings yet
WM Insurance Set 1 Risk Mangement Process
17 pages
Performance Management System A Strategi
No ratings yet
Performance Management System A Strategi
16 pages
WM Insurance Sums 1
No ratings yet
WM Insurance Sums 1
4 pages
Linear Discriminant Analysis (Lda)
No ratings yet
Linear Discriminant Analysis (Lda)
11 pages
Scikit Learn Infographic PDF
No ratings yet
Scikit Learn Infographic PDF
1 page
Be - Computer Engineering - Semester 5 - 2023 - October - Data Science and Visualization DSV 2019 Pattern
No ratings yet
Be - Computer Engineering - Semester 5 - 2023 - October - Data Science and Visualization DSV 2019 Pattern
2 pages
2.1 Regression Analysis
No ratings yet
2.1 Regression Analysis
28 pages
TYBA SEM 6 Statistics SPL 8 Backlog APRIL 2020
No ratings yet
TYBA SEM 6 Statistics SPL 8 Backlog APRIL 2020
2 pages
GTU DOM Paper 4
No ratings yet
GTU DOM Paper 4
2 pages
Feature Fusion Siamese Networkfor Breast CA
No ratings yet
Feature Fusion Siamese Networkfor Breast CA
36 pages
NUMERICAL DIFFERENTIATION
No ratings yet
NUMERICAL DIFFERENTIATION
12 pages
Analysis of Ecg Signals Main Edited
No ratings yet
Analysis of Ecg Signals Main Edited
15 pages
Maths Class12 Project
No ratings yet
Maths Class12 Project
2 pages
Understanding Machine Learning P2
No ratings yet
Understanding Machine Learning P2
224 pages
Nestle Pakistan Assignement
No ratings yet
Nestle Pakistan Assignement
2 pages
Genetic Evolution of TIC-TAC-ToE
No ratings yet
Genetic Evolution of TIC-TAC-ToE
8 pages
Financial Modeling & Coding
No ratings yet
Financial Modeling & Coding
3 pages
Stochastic Processes Third Edition J. Medhi pdf download
No ratings yet
Stochastic Processes Third Edition J. Medhi pdf download
55 pages
Determinants and Singular Matrices PDF
No ratings yet
Determinants and Singular Matrices PDF
2 pages
Lecture3
No ratings yet
Lecture3
25 pages
Thermodynamics Class 11 Notes Physics Chapter 12
No ratings yet
Thermodynamics Class 11 Notes Physics Chapter 12
7 pages
21EC733
No ratings yet
21EC733
4 pages
02 Course-Outline
No ratings yet
02 Course-Outline
1 page
Maggi 2
No ratings yet
Maggi 2
8 pages
20. Hashing Technique
No ratings yet
20. Hashing Technique
8 pages
Numerical Differentiation and Integration
No ratings yet
Numerical Differentiation and Integration
15 pages
I PUC Stats MQP1 IA2
No ratings yet
I PUC Stats MQP1 IA2
3 pages
MG311 TUTORIAL - Statistics - 2016
No ratings yet
MG311 TUTORIAL - Statistics - 2016
1 page
Floating Point Algorithmic Math Package User's Guide
No ratings yet
Floating Point Algorithmic Math Package User's Guide
8 pages
Frequency Dependent Transmission Line Modeling Utilizing Transposed Conditions
No ratings yet
Frequency Dependent Transmission Line Modeling Utilizing Transposed Conditions
4 pages
Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies
No ratings yet
Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies
19 pages
Condition Assessment Models For Sewer Pipelines
No ratings yet
Condition Assessment Models For Sewer Pipelines
121 pages
Mapping Precipitation in Switzerland With Ordinary and Indicator Kriging
No ratings yet
Mapping Precipitation in Switzerland With Ordinary and Indicator Kriging
12 pages
Sorting Algorithm Latex Presentation
No ratings yet
Sorting Algorithm Latex Presentation
27 pages
Binary Search and Linear Search (DSA REPORT) .
No ratings yet
Binary Search and Linear Search (DSA REPORT) .
15 pages

For Classification Models

Uploaded by

For Classification Models

Uploaded by

For

This measure is based on Three decision tree algorithms

Calculate Probability of each Class age income student credt_rating Class:buy_computer

Let : p1 = Prob. Of yes = 9/14 = 0.64

Let : p2 = Prob. Of no = 5/14 = 0.36

Entropy (D) = 0.94

Entropy = 0 Entropy = 0.97 Entropy = 0.97

To Calculate Info(Age) or Entropy(Age) : Weighted Average of THREE is done.

Total = 4 Total = 5 Total = 5 Grand-Total = 14

Entropy = 0 Entropy = 0.97 Entropy = 0.97

Entropy (Age) = (4/14)*0 + (5/14)*0.97 + (5/14)*0.97 = 0.69

Gain (age) = Entropy(D) – Entropy(age)

Gain (income) = 0.03 senior

A Gini score gives an idea of how good a split

A perfect separation results in a Gini score of 0,

Decision Tree middle_aged

Gini Index middle_aged

Objective : Find Decision tree to youth medium no fair no

youth medium yes excellent yes

Gini Index for all variables middle_aged

Let p1 = probability of ‘yes’ senior

Let p2 = probability of ‘no’ senior

Gini Index for all variables middle_aged

Let p1 = probability of ‘yes’ senior

Let p2 = probability of ‘no’ senior

• Age = 0.34 age income student credt_rating Class:buy_computer

• Credit_rating =0.43 middle_aged

“Age” is lowest cost or GINI senior

 Split “Age” at root node. middle_aged

Split further Split further

youth senior Credit-rating

no yes senior youth yes

Telephone service companies, Internet service providers, pay TV companies,

Each row represents a customer, each column contains customer’s

The raw data contains 7043 rows (customers)

The “Churn” column is our target.

df$OnlineBackup <- revalue(df$OnlineBackup,

df$MultipleConnections <- revalue(df$MultipleConnections,

# check for NAs,

sapply(churn, function(x) sum(is.na(x)))

trn <- df[rno, ]

Data Set (30%)

tst$rfPredProb= predict(rf, newdata=tst,type = 'prob')

But threshold used was 0.50 for ‘Yes’ and ‘No’

To…check accuracy at various thresholds ( 0 to 1 )

Plot ROC curve & calculate AUC ( Area Under Curve )

FPR ( False Positive Rate )

1. Accuracy of RF > DT  RF is best model

You might also like

Entropy (Age) = (4/14)0 + (5/14)0.97 + (5/14)*0.97 = 0.69