0% found this document useful (0 votes)

107 views

DM Group Assignment

The document summarizes the steps taken to build a logistic regression model to predict customer churn using a telecom dataset. Key steps included: 1. Exploratory data analysis including visualizations of categorical and continuous variables to identify predictors and response variable. 2. Splitting the data into 70% training and 30% test sets. 3. Building a logistic regression model on the training set with significant predictors identified from chi-square tests and scaling. 4. Evaluating the model's performance on both training and test sets using metrics like AUC, confusion matrix, precision, recall and F1 score.

Uploaded by

Kritima

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

DM Group Assignment

Uploaded by

Kritima

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Predictive Modelling

Group Assignment
Churn Model – Logistic Regression

Submitted by: Prajakta Vast, Surbhi Rane, Kritima Upadhyay, Supriya Kadam, Pratyush Chodhary
Table of Contents:-
1. Selection of the Response Variable and the Predictors variables.
2. Exploratory Data Analysis
3. Training and testing the data base on the 70 % and 30 %
4. Variable Selection using various methods
5. Analyze the accuracy of the model

1. Selection of Response and Predictor Variables.

The Given Assignment objective is to use the learnings of the Predictive
Modelling to build a logistic regression model predicting the customers
who will cancel their service.

Logistic regression is a predictive modelling algorithm that is used when

the Y variable is binary categorical. That is, it can take only two values like
1 or 0. The goal is to determine a mathematical equation that can be used
to predict the probability of event 1. Once the equation is established, it
can be used to predict the Y when only the X’s are known.

Here, the dataset was given without indicating the independent and
dependent variables, so considering the output as the Churn Rate
(Categorical Variable) i.e. Y.

Data Preparation:-
Note: All the working has been using R Studio & Tableau.

## Let us first set the working directory path

> setwd)
> getwd()

## Import the data file

Data<-read.csv("GA_Dataset.csv")

2. Exploratory Data Analysis

Let us first study the structure of the dataset provided.

dim(Data)
str(Data)

head(Data)

Few points to be considered:-

 Data Set consists of 3333 observations & 11 variables

 All of our variables are in integer or numerical format
 In order to find out what should be considered as Y Variable and which is
to be considered as Independent Variable we have performed the Head
command in order to find the head and tail of the data.
 Out of which Churn is our dependent variables & rest 10 are our
independent variables
 We will convert Churn, ContractRenewal & DataPlan to factors for EDA
purpose only.

Data$Churn<-as.factor(Data$Churn)

Churn is a categorical variable with 1 & 0

Data$ContractRenewal<-as.factor(Data$ContractRenewal)

ContractRenewal is a categorical variable with 1 & 0

Data$DataPlan<-as.factor(Data$DataPlan)

DataPlan is a categorical variable with 1 & 0

describe(Data)

Since the n is 3333 in all the 11 variables it means that there are no
missing values

Summary(Data)

Histogram for all the Numerical Variables.

Data %>%

keep(is.numeric) %>%

gather() %>%
ggplot(aes(value)) +

facet_wrap(~ key, scales = "free") +

geom_histogram()

Few insights from above histogram:-

 The median of DataUsage is 0 GB which means 50% of the customers do
not use
internet at all.
 Majority of the customers use less than 0.5 GB of monthly data.

Customer Churn

table(Data$Churn)

0 1
2850 483

round(prop.table(table(Data$Churn)),3)

0 1
0.855 0.145

g1<-ggplot(Data, aes(Churn, ..count..))+geom_bar(aes(fill = Churn),

position = "dodge")+ggtitle("Customer Churn")
+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer Churn", values = c("darkorange",
"darkorchid"))

 We can see that 483 customers out of 3333 cancelled service. 14% of the
customers churned. We can say that the data set is unbalanced, which
means that the 0s and 1s in the dependent variable are highly unequal.

Customer & Contract Renewal

table(Data$ContractRenewal)
0 1

323 3010

round(prop.table(table(Data$ContractRenewal)),3)

0 1
0.097 0.903

g2<-ggplot(Data, aes(ContractRenewal, ..count..))+geom_bar(aes(fill

= ContractRenewal), position = "dodge")+ggtitle("Customer & Contract
Renewal")+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer & Contract Renewal", values =
c("darkorange", "darkorchid"))

g2
 9.7% of the customers did not Renew the Contract

Customer & Data Plan

table(Data$DataPlan)

0 1
2411 922

round(prop.table(table(Data$DataPlan)),3)

0 1
0.723 0.277

g3<-ggplot(Data, aes(DataPlan, ..count..))+geom_bar(aes(fill =

DataPlan), position = "dodge")+ggtitle("Customer & Data Plan")
+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer & Data Plan", values = c("darkorange",
"darkorchid"))

 Only 27.7% of the customers have Data Plan rest 72.3% customers do not
have any data plan.
Data Usage

summary(Data$DataUsage)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 0.0000 0.8165 1.7800 5.4000

dataUsage <- cut(Data$DataUsage, include.lowest = TRUE, breaks =

seq(0, 5.5, by = 0.5))

g5<-ggplot(Data, aes(dataUsage, ..count.., fill = Churn)) +

geom_bar(position="dodge")
g5

 Churning is maximum in 0-0.5 Data Usage Category.

boxplot(Data$DataUsage~Data$Churn,horizontal = T,main="Customer
Churn vs Data Usage",xlab="Data Usage",ylab="Customer Churn")
 The box plot is left skewed which means that customers who churned
didn’t use data or didn’t have a data plan.

CustServCalls

summary(Data$CustServCalls)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.000 1.000 1.000 1.563 2.000 9.000

g6<-ggplot(Data, aes(CustServCalls, ..count.., fill = Churn)) +

geom_bar(position="dodge")

g6
 Churn Rate is higher if a customer makes more than 4 calls to customer
service.

BoxPlot with respective Churn

boxplot(Data$CustServCalls~Data$Churn,horizontal = T,main="Customer
Churn vs Customer Service Calls",xlab="Customer Service
Calls",ylab="Customer Churn")

 Customers who churned had a median of 2 customer service calls.

The boxplot is right skewed which means customers who churned
has more number of customer service calls

MonthlyCharge
monthlycharge <- cut(Data$MonthlyCharge, include.lowest = TRUE,
breaks = seq(14, 114, by = 10))

g9<-ggplot(Data, aes(monthlycharge, ..count.., fill = Churn)) +

geom_bar(position="dodge")

 Churn Rate is max if the monthly bill is between 64-74.

Overage Fee

overagefee <- cut(Data$OverageFee, include.lowest = TRUE, breaks =

seq(0, 19, by = 1.9))

g10<-ggplot(Data, aes(overagefee, ..count.., fill = Churn)) +

geom_bar(position="dodge")

g10

 No clear pattern is visible from Overage Fee

Checking for Correlation between variables
Data_Num<-read.csv("c:/Kritima PGP-BABI/GA_Dataset.csv")

Cor_Matrix<-round(cor(Data_Num[,-1]),2)

Cor_Matrix

corrplot(cor(Data_Num),type="lower",order="hclust",method =
"color",bg="black",title = "Correlation Plot")
 DataPlan & DataUsage are Highly correlated (0.95)
 DataPlan & Monthly charge highly correlated (0.74)
 DataUsage & Monthly charge highly correlated (0.78)
 DayMins & Monthly charge moderately correlated (0.57)
 Churn does not seem to be highly corelated with any of the
variables.

In order to determine significant categorical variables,

performing chi square test

attach(Data)

table(Churn,ContractRenewal)

ContractRenewal
Churn 0 1
0 186 2664
1 137 346

chisq.test(table(Churn,ContractRenewal))

Pearson's Chi-squared test with Yates' continuity correction

data: table(Churn, ContractRenewal)
X-squared = 222.57, df = 1, p-value < 2.2e-16

 pvalue less than 0.05, hence ContractRenewal is a significant

predictor

table(Churn,DataPlan)

DataPlan
Churn 0 1
0 2008 842
1 403 80

chisq.test(table(Churn,DataPlan))

 pvalue less than 0.05, hence DataPlan is a significant predictor

Scaling Data Set

Data_Num.scale<-as.data.frame(scale(Data_Num[,-1]))

New_Data<-add_column(Data_Num.scale,Churn,.before = AccountWeeks)

logit = glm(Churn ~ ., data = New_Data[,-1],family=binomial)

summary(logit)
 From above logistic regression, ContractRenewal, DataPlan,
CustServCalls & RoamMins are significant predictors of Customer
Churn. Hence we will consider ContractRenewal, DataPlan,
CustServCalls & RoamMins in our logistic regression model.

Split data into training and test

set.seed(1212)

library(caTools)

split = sample.split(New_Data$Churn, SplitRatio = 0.70)

train = subset(New_Data, split == TRUE)

test = subset(New_Data, split == FALSE)

Modelling on Train Data Set

attach(train)

logit=glm(Churn ~ ContractRenewal+DataPlan+CustServCalls+RoamMins,
data = train[,-1],family=binomial)
summary(logit)

Identifing overall fitness of model using log likelihood Ratio Test

library(lmtest)

lrtest(logit)
Coefficients importance
summary(logit)

Explanatory power of odds and probability

exp(coef(logit))

Building Confusion Matrix

prediction <- predict(logit,type = "response")

prediction

cutoff = floor(prediction+0.5)
cutoff

confmat = table(Predicted=cutoff,Actual=train$Churn)

confmat

Precision, Recall & F1 Score

install.packages("caret")

library(caret)

confusionMatrix(confmat,positive="1",mode="everything")

ROC Curve

library(pROC)

plot.roc(train$Churn,prediction)
auc(train$Churn,prediction)

Area under the curve: 0.749

Predicting for Test Data Set

prediction <- predict(logit,test,type = "response")

prediction

cutoff = floor(prediction+0.5)

cutoff

confmat = table(Predicted=cutoff,Actual=test$Churn)

confmat

Precision, Recall & F1 Score

install.packages("caret")

library(caret)

confusionMatrix(confmat,positive="1",mode="everything")
ROC Curve

plot.roc(test$Churn,prediction)
auc(test$Churn,prediction)

Area under the curve: 0.7269

3D Printing Conversion v4
100% (1)
3D Printing Conversion v4
2 pages
Group Assignment - Predictive Modelling
No ratings yet
Group Assignment - Predictive Modelling
23 pages
DATA ANALYTICS AND RESEARCH
No ratings yet
DATA ANALYTICS AND RESEARCH
8 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
CustomerChurn Assignment
100% (3)
CustomerChurn Assignment
15 pages
ML Practical File
100% (2)
ML Practical File
43 pages
Linear_Regression
No ratings yet
Linear_Regression
18 pages
CM
No ratings yet
CM
8 pages
2 - T.test Activity
No ratings yet
2 - T.test Activity
13 pages
Modern Regression Homework 5-1
No ratings yet
Modern Regression Homework 5-1
8 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
Churn Assignment
No ratings yet
Churn Assignment
11 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Modern Regression 1 - hw6
No ratings yet
Modern Regression 1 - hw6
11 pages
Homework #6
No ratings yet
Homework #6
16 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Lab6_HT and CI in R some solutions
No ratings yet
Lab6_HT and CI in R some solutions
7 pages
Chapter 9 BTC PRICE PRED
No ratings yet
Chapter 9 BTC PRICE PRED
12 pages
Logistic Regression
100% (1)
Logistic Regression
10 pages
Pad Assignment No - 01
No ratings yet
Pad Assignment No - 01
6 pages
XSTK Câu hỏi
No ratings yet
XSTK Câu hỏi
19 pages
Kunal DS
No ratings yet
Kunal DS
92 pages
Transformações No R
No ratings yet
Transformações No R
4 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
R Programming Student Lab Manual-52-63-3-12
No ratings yet
R Programming Student Lab Manual-52-63-3-12
10 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
How to Perform Simple Linear Regression in Python
No ratings yet
How to Perform Simple Linear Regression in Python
8 pages
R Module 5
No ratings yet
R Module 5
21 pages
CDA_Assignment4
No ratings yet
CDA_Assignment4
12 pages
Basic Econometrics III
No ratings yet
Basic Econometrics III
23 pages
Lab 9 Report
No ratings yet
Lab 9 Report
5 pages
Report Stats PDF
No ratings yet
Report Stats PDF
23 pages
Untitled Document
No ratings yet
Untitled Document
27 pages
Assignment R New 1
No ratings yet
Assignment R New 1
26 pages
Pivot Table
No ratings yet
Pivot Table
16 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Advance Stats Problem Statement 3 Analysis
No ratings yet
Advance Stats Problem Statement 3 Analysis
5 pages
Sakhil Capstone
No ratings yet
Sakhil Capstone
20 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Assignment 1
No ratings yet
Assignment 1
16 pages
Ict Data Analysis
No ratings yet
Ict Data Analysis
9 pages
Fitting The Nelson-Siegel-Svensson Model With Differential Evolution
No ratings yet
Fitting The Nelson-Siegel-Svensson Model With Differential Evolution
10 pages
WEEK
No ratings yet
WEEK
17 pages
Problem Set #1
No ratings yet
Problem Set #1
6 pages
41 Perusse Alexander Aperusse PDF
No ratings yet
41 Perusse Alexander Aperusse PDF
7 pages
Chapter 3 Homework (Take 2)
No ratings yet
Chapter 3 Homework (Take 2)
7 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
13 pages
Regression
No ratings yet
Regression
46 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Aakash S Project Report
No ratings yet
Aakash S Project Report
12 pages
Statistics For Managers - MB0040 (Assignment Winter 2013)
No ratings yet
Statistics For Managers - MB0040 (Assignment Winter 2013)
13 pages
Linear regression
No ratings yet
Linear regression
1 page
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Explanatory Data Analysis
No ratings yet
Explanatory Data Analysis
28 pages
Dimensional Reduction in R
No ratings yet
Dimensional Reduction in R
24 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
OOT (RGPV) IV Sem CS
No ratings yet
OOT (RGPV) IV Sem CS
5 pages
Railway Reservation System UML1 Diagrams
No ratings yet
Railway Reservation System UML1 Diagrams
7 pages
SQL Primary Key
No ratings yet
SQL Primary Key
63 pages
All Worksheets MYSQL
50% (2)
All Worksheets MYSQL
30 pages
Keys in Database: By-Suraj Dewasi Prachi Singh Rathore
No ratings yet
Keys in Database: By-Suraj Dewasi Prachi Singh Rathore
17 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (6)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
PPD For
No ratings yet
PPD For
4 pages
Ooad Solutions
100% (1)
Ooad Solutions
11 pages
Adobe Photoshop CS3
100% (1)
Adobe Photoshop CS3
152 pages
Uml Donald Bell Ibmpdf
100% (1)
Uml Donald Bell Ibmpdf
18 pages
Forecasting Using Minitab
No ratings yet
Forecasting Using Minitab
10 pages
SLIDE 7 Object and State Chart Diagram
No ratings yet
SLIDE 7 Object and State Chart Diagram
21 pages
Taller6 Econometria2
No ratings yet
Taller6 Econometria2
3 pages
OMT 8604 Logistics in Supply Chain Management: Master of Business Administration
No ratings yet
OMT 8604 Logistics in Supply Chain Management: Master of Business Administration
36 pages
Tutorial Mugen
No ratings yet
Tutorial Mugen
2 pages
13 Scale Space Combined
No ratings yet
13 Scale Space Combined
31 pages
Sybca (Sem - III) US03CBCA01 - Relational Database Management Systems-I Question Bank
No ratings yet
Sybca (Sem - III) US03CBCA01 - Relational Database Management Systems-I Question Bank
9 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
6-Annotation Families in Revit - Revit Operations - Modelical
No ratings yet
6-Annotation Families in Revit - Revit Operations - Modelical
8 pages
Quiz CST 355 TT
No ratings yet
Quiz CST 355 TT
41 pages
Unit II (Methodologies)
No ratings yet
Unit II (Methodologies)
48 pages
Representation of Knowledge: Course Material (Lecture Notes)
No ratings yet
Representation of Knowledge: Course Material (Lecture Notes)
23 pages
Uml Elevator
No ratings yet
Uml Elevator
29 pages
3D Graphics With OpenGL - The Basic Theory
No ratings yet
3D Graphics With OpenGL - The Basic Theory
22 pages
Copy - of - Linear - Regression (1) (1) .Ipynb - Colaboratory PDF
No ratings yet
Copy - of - Linear - Regression (1) (1) .Ipynb - Colaboratory PDF
3 pages
Database Design Process
No ratings yet
Database Design Process
25 pages
Mca 304
No ratings yet
Mca 304
2 pages
ER To Relational Mapping Rules
No ratings yet
ER To Relational Mapping Rules
2 pages
Reference Partitioning Method
No ratings yet
Reference Partitioning Method
4 pages

DM Group Assignment

Uploaded by

DM Group Assignment

Uploaded by

Predictive Modelling

1. Selection of Response and Predictor Variables.

Logistic regression is a predictive modelling algorithm that is used when

## Let us first set the working directory path

## Import the data file

2. Exploratory Data Analysis

Few points to be considered:-

 Data Set consists of 3333 observations & 11 variables

Churn is a categorical variable with 1 & 0

ContractRenewal is a categorical variable with 1 & 0

DataPlan is a categorical variable with 1 & 0

Histogram for all the Numerical Variables.

facet_wrap(~ key, scales = "free") +

Few insights from above histogram:-

g1<-ggplot(Data, aes(Churn, ..count..))+geom_bar(aes(fill = Churn),

Customer & Contract Renewal

g2<-ggplot(Data, aes(ContractRenewal, ..count..))+geom_bar(aes(fill

Customer & Data Plan

g3<-ggplot(Data, aes(DataPlan, ..count..))+geom_bar(aes(fill =

Min. 1st Qu. Median Mean 3rd Qu. Max.

dataUsage <- cut(Data$DataUsage, include.lowest = TRUE, breaks =

g5<-ggplot(Data, aes(dataUsage, ..count.., fill = Churn)) +

 Churning is maximum in 0-0.5 Data Usage Category.

Min. 1st Qu. Median Mean 3rd Qu. Max.

g6<-ggplot(Data, aes(CustServCalls, ..count.., fill = Churn)) +

BoxPlot with respective Churn

 Customers who churned had a median of 2 customer service calls.

g9<-ggplot(Data, aes(monthlycharge, ..count.., fill = Churn)) +

 Churn Rate is max if the monthly bill is between 64-74.

overagefee <- cut(Data$OverageFee, include.lowest = TRUE, breaks =

g10<-ggplot(Data, aes(overagefee, ..count.., fill = Churn)) +

 No clear pattern is visible from Overage Fee

In order to determine significant categorical variables,

Pearson's Chi-squared test with Yates' continuity correction

 pvalue less than 0.05, hence ContractRenewal is a significant

 pvalue less than 0.05, hence DataPlan is a significant predictor

Scaling Data Set

logit = glm(Churn ~ ., data = New_Data[,-1],family=binomial)

Split data into training and test

split = sample.split(New_Data$Churn, SplitRatio = 0.70)

train = subset(New_Data, split == TRUE)

test = subset(New_Data, split == FALSE)

Modelling on Train Data Set

Identifing overall fitness of model using log likelihood Ratio Test

Explanatory power of odds and probability

Building Confusion Matrix

prediction <- predict(logit,type = "response")

Precision, Recall & F1 Score

Area under the curve: 0.749

Predicting for Test Data Set

Precision, Recall & F1 Score

Area under the curve: 0.7269

You might also like