DM Group Assignment
DM Group Assignment
Group Assignment
Churn Model – Logistic Regression
Submitted by: Prajakta Vast, Surbhi Rane, Kritima Upadhyay, Supriya Kadam, Pratyush Chodhary
Table of Contents:-
1. Selection of the Response Variable and the Predictors variables.
2. Exploratory Data Analysis
3. Training and testing the data base on the 70 % and 30 %
4. Variable Selection using various methods
5. Analyze the accuracy of the model
Here, the dataset was given without indicating the independent and
dependent variables, so considering the output as the Churn Rate
(Categorical Variable) i.e. Y.
Data Preparation:-
Note: All the working has been using R Studio & Tableau.
> setwd)
> getwd()
Data<-read.csv("GA_Dataset.csv")
dim(Data)
str(Data)
head(Data)
Data$Churn<-as.factor(Data$Churn)
Data$ContractRenewal<-as.factor(Data$ContractRenewal)
describe(Data)
Since the n is 3333 in all the 11 variables it means that there are no
missing values
Summary(Data)
Data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram()
Customer Churn
table(Data$Churn)
0 1
2850 483
round(prop.table(table(Data$Churn)),3)
0 1
0.855 0.145
g1
We can see that 483 customers out of 3333 cancelled service. 14% of the
customers churned. We can say that the data set is unbalanced, which
means that the 0s and 1s in the dependent variable are highly unequal.
table(Data$ContractRenewal)
0 1
323 3010
round(prop.table(table(Data$ContractRenewal)),3)
0 1
0.097 0.903
g2
9.7% of the customers did not Renew the Contract
table(Data$DataPlan)
0 1
2411 922
round(prop.table(table(Data$DataPlan)),3)
0 1
0.723 0.277
g3
Only 27.7% of the customers have Data Plan rest 72.3% customers do not
have any data plan.
Data Usage
summary(Data$DataUsage)
boxplot(Data$DataUsage~Data$Churn,horizontal = T,main="Customer
Churn vs Data Usage",xlab="Data Usage",ylab="Customer Churn")
The box plot is left skewed which means that customers who churned
didn’t use data or didn’t have a data plan.
CustServCalls
summary(Data$CustServCalls)
g6
Churn Rate is higher if a customer makes more than 4 calls to customer
service.
boxplot(Data$CustServCalls~Data$Churn,horizontal = T,main="Customer
Churn vs Customer Service Calls",xlab="Customer Service
Calls",ylab="Customer Churn")
MonthlyCharge
monthlycharge <- cut(Data$MonthlyCharge, include.lowest = TRUE,
breaks = seq(14, 114, by = 10))
g9
Overage Fee
g10
Cor_Matrix<-round(cor(Data_Num[,-1]),2)
Cor_Matrix
corrplot(cor(Data_Num),type="lower",order="hclust",method =
"color",bg="black",title = "Correlation Plot")
DataPlan & DataUsage are Highly correlated (0.95)
DataPlan & Monthly charge highly correlated (0.74)
DataUsage & Monthly charge highly correlated (0.78)
DayMins & Monthly charge moderately correlated (0.57)
Churn does not seem to be highly corelated with any of the
variables.
attach(Data)
table(Churn,ContractRenewal)
ContractRenewal
Churn 0 1
0 186 2664
1 137 346
chisq.test(table(Churn,ContractRenewal))
table(Churn,DataPlan)
DataPlan
Churn 0 1
0 2008 842
1 403 80
chisq.test(table(Churn,DataPlan))
New_Data<-add_column(Data_Num.scale,Churn,.before = AccountWeeks)
summary(logit)
From above logistic regression, ContractRenewal, DataPlan,
CustServCalls & RoamMins are significant predictors of Customer
Churn. Hence we will consider ContractRenewal, DataPlan,
CustServCalls & RoamMins in our logistic regression model.
library(caTools)
logit=glm(Churn ~ ContractRenewal+DataPlan+CustServCalls+RoamMins,
data = train[,-1],family=binomial)
summary(logit)
library(lmtest)
lrtest(logit)
Coefficients importance
summary(logit)
exp(coef(logit))
prediction
cutoff = floor(prediction+0.5)
cutoff
confmat = table(Predicted=cutoff,Actual=train$Churn)
confmat
install.packages("caret")
library(caret)
confusionMatrix(confmat,positive="1",mode="everything")
ROC Curve
library(pROC)
plot.roc(train$Churn,prediction)
auc(train$Churn,prediction)
prediction
cutoff = floor(prediction+0.5)
cutoff
confmat = table(Predicted=cutoff,Actual=test$Churn)
confmat
install.packages("caret")
library(caret)
confusionMatrix(confmat,positive="1",mode="everything")
ROC Curve
plot.roc(test$Churn,prediction)
auc(test$Churn,prediction)