0% found this document useful (0 votes)
107 views

DM Group Assignment

The document summarizes the steps taken to build a logistic regression model to predict customer churn using a telecom dataset. Key steps included: 1. Exploratory data analysis including visualizations of categorical and continuous variables to identify predictors and response variable. 2. Splitting the data into 70% training and 30% test sets. 3. Building a logistic regression model on the training set with significant predictors identified from chi-square tests and scaling. 4. Evaluating the model's performance on both training and test sets using metrics like AUC, confusion matrix, precision, recall and F1 score.

Uploaded by

Kritima
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

DM Group Assignment

The document summarizes the steps taken to build a logistic regression model to predict customer churn using a telecom dataset. Key steps included: 1. Exploratory data analysis including visualizations of categorical and continuous variables to identify predictors and response variable. 2. Splitting the data into 70% training and 30% test sets. 3. Building a logistic regression model on the training set with significant predictors identified from chi-square tests and scaling. 4. Evaluating the model's performance on both training and test sets using metrics like AUC, confusion matrix, precision, recall and F1 score.

Uploaded by

Kritima
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Predictive Modelling

Group Assignment
Churn Model – Logistic Regression

Submitted by: Prajakta Vast, Surbhi Rane, Kritima Upadhyay, Supriya Kadam, Pratyush Chodhary
Table of Contents:-
1. Selection of the Response Variable and the Predictors variables.
2. Exploratory Data Analysis
3. Training and testing the data base on the 70 % and 30 %
4. Variable Selection using various methods
5. Analyze the accuracy of the model

1. Selection of Response and Predictor Variables.


The Given Assignment objective is to use the learnings of the Predictive
Modelling to build a logistic regression model predicting the customers
who will cancel their service.

Logistic regression is a predictive modelling algorithm that is used when


the Y variable is binary categorical. That is, it can take only two values like
1 or 0. The goal is to determine a mathematical equation that can be used
to predict the probability of event 1. Once the equation is established, it
can be used to predict the Y when only the X’s are known.

Here, the dataset was given without indicating the independent and
dependent variables, so considering the output as the Churn Rate
(Categorical Variable) i.e. Y.

Data Preparation:-
Note: All the working has been using R Studio & Tableau.

## Let us first set the working directory path

> setwd)
> getwd()

## Import the data file

Data<-read.csv("GA_Dataset.csv")

2. Exploratory Data Analysis


Let us first study the structure of the dataset provided.

dim(Data)
str(Data)

head(Data)

Few points to be considered:-

 Data Set consists of 3333 observations & 11 variables


 All of our variables are in integer or numerical format
 In order to find out what should be considered as Y Variable and which is
to be considered as Independent Variable we have performed the Head
command in order to find the head and tail of the data.
 Out of which Churn is our dependent variables & rest 10 are our
independent variables
 We will convert Churn, ContractRenewal & DataPlan to factors for EDA
purpose only.

Data$Churn<-as.factor(Data$Churn)

Churn is a categorical variable with 1 & 0

Data$ContractRenewal<-as.factor(Data$ContractRenewal)

ContractRenewal is a categorical variable with 1 & 0


Data$DataPlan<-as.factor(Data$DataPlan)

DataPlan is a categorical variable with 1 & 0

describe(Data)

Since the n is 3333 in all the 11 variables it means that there are no
missing values

Summary(Data)

Histogram for all the Numerical Variables.

Data %>%

keep(is.numeric) %>%

gather() %>%
ggplot(aes(value)) +

facet_wrap(~ key, scales = "free") +

geom_histogram()

Few insights from above histogram:-


 The median of DataUsage is 0 GB which means 50% of the customers do
not use
internet at all.
 Majority of the customers use less than 0.5 GB of monthly data.

Customer Churn

table(Data$Churn)

0 1
2850 483

round(prop.table(table(Data$Churn)),3)

0 1
0.855 0.145

g1<-ggplot(Data, aes(Churn, ..count..))+geom_bar(aes(fill = Churn),


position = "dodge")+ggtitle("Customer Churn")
+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer Churn", values = c("darkorange",
"darkorchid"))

g1

 We can see that 483 customers out of 3333 cancelled service. 14% of the
customers churned. We can say that the data set is unbalanced, which
means that the 0s and 1s in the dependent variable are highly unequal.

Customer & Contract Renewal

table(Data$ContractRenewal)
0 1

323 3010

round(prop.table(table(Data$ContractRenewal)),3)

0 1
0.097 0.903

g2<-ggplot(Data, aes(ContractRenewal, ..count..))+geom_bar(aes(fill


= ContractRenewal), position = "dodge")+ggtitle("Customer & Contract
Renewal")+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer & Contract Renewal", values =
c("darkorange", "darkorchid"))

g2
 9.7% of the customers did not Renew the Contract

Customer & Data Plan

table(Data$DataPlan)

0 1
2411 922

round(prop.table(table(Data$DataPlan)),3)

0 1
0.723 0.277

g3<-ggplot(Data, aes(DataPlan, ..count..))+geom_bar(aes(fill =


DataPlan), position = "dodge")+ggtitle("Customer & Data Plan")
+geom_text(stat='count',aes(label=..count..), vjust=-
0.5,hjust=1)+geom_text(stat='count',aes(label=paste0("(",round(..cou
nt../sum(..count..)*100),"%)")),vjust=-0.5,hjust=-
0.3)+theme(plot.title = element_text(hjust = 0.5))
+scale_fill_manual("Customer & Data Plan", values = c("darkorange",
"darkorchid"))

g3

 Only 27.7% of the customers have Data Plan rest 72.3% customers do not
have any data plan.
Data Usage

summary(Data$DataUsage)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.0000 0.0000 0.0000 0.8165 1.7800 5.4000

dataUsage <- cut(Data$DataUsage, include.lowest = TRUE, breaks =


seq(0, 5.5, by = 0.5))

g5<-ggplot(Data, aes(dataUsage, ..count.., fill = Churn)) +


geom_bar(position="dodge")
g5

 Churning is maximum in 0-0.5 Data Usage Category.

boxplot(Data$DataUsage~Data$Churn,horizontal = T,main="Customer
Churn vs Data Usage",xlab="Data Usage",ylab="Customer Churn")
 The box plot is left skewed which means that customers who churned
didn’t use data or didn’t have a data plan.

CustServCalls

summary(Data$CustServCalls)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.000 1.000 1.000 1.563 2.000 9.000

g6<-ggplot(Data, aes(CustServCalls, ..count.., fill = Churn)) +


geom_bar(position="dodge")

g6
 Churn Rate is higher if a customer makes more than 4 calls to customer
service.

BoxPlot with respective Churn

boxplot(Data$CustServCalls~Data$Churn,horizontal = T,main="Customer
Churn vs Customer Service Calls",xlab="Customer Service
Calls",ylab="Customer Churn")

 Customers who churned had a median of 2 customer service calls.


The boxplot is right skewed which means customers who churned
has more number of customer service calls

MonthlyCharge
monthlycharge <- cut(Data$MonthlyCharge, include.lowest = TRUE,
breaks = seq(14, 114, by = 10))

g9<-ggplot(Data, aes(monthlycharge, ..count.., fill = Churn)) +


geom_bar(position="dodge")

g9

 Churn Rate is max if the monthly bill is between 64-74.

Overage Fee

overagefee <- cut(Data$OverageFee, include.lowest = TRUE, breaks =


seq(0, 19, by = 1.9))

g10<-ggplot(Data, aes(overagefee, ..count.., fill = Churn)) +


geom_bar(position="dodge")

g10

 No clear pattern is visible from Overage Fee


Checking for Correlation between variables
Data_Num<-read.csv("c:/Kritima PGP-BABI/GA_Dataset.csv")

Cor_Matrix<-round(cor(Data_Num[,-1]),2)

Cor_Matrix

corrplot(cor(Data_Num),type="lower",order="hclust",method =
"color",bg="black",title = "Correlation Plot")
 DataPlan & DataUsage are Highly correlated (0.95)
 DataPlan & Monthly charge highly correlated (0.74)
 DataUsage & Monthly charge highly correlated (0.78)
 DayMins & Monthly charge moderately correlated (0.57)
 Churn does not seem to be highly corelated with any of the
variables.

In order to determine significant categorical variables,


performing chi square test

attach(Data)

table(Churn,ContractRenewal)

ContractRenewal
Churn 0 1
0 186 2664
1 137 346

chisq.test(table(Churn,ContractRenewal))

Pearson's Chi-squared test with Yates' continuity correction


data: table(Churn, ContractRenewal)
X-squared = 222.57, df = 1, p-value < 2.2e-16

 pvalue less than 0.05, hence ContractRenewal is a significant


predictor

table(Churn,DataPlan)

DataPlan
Churn 0 1
0 2008 842
1 403 80

chisq.test(table(Churn,DataPlan))

 pvalue less than 0.05, hence DataPlan is a significant predictor

Scaling Data Set


Data_Num.scale<-as.data.frame(scale(Data_Num[,-1]))

New_Data<-add_column(Data_Num.scale,Churn,.before = AccountWeeks)

logit = glm(Churn ~ ., data = New_Data[,-1],family=binomial)

summary(logit)
 From above logistic regression, ContractRenewal, DataPlan,
CustServCalls & RoamMins are significant predictors of Customer
Churn. Hence we will consider ContractRenewal, DataPlan,
CustServCalls & RoamMins in our logistic regression model.

Split data into training and test


set.seed(1212)

library(caTools)

split = sample.split(New_Data$Churn, SplitRatio = 0.70)

train = subset(New_Data, split == TRUE)

test = subset(New_Data, split == FALSE)

Modelling on Train Data Set


attach(train)

logit=glm(Churn ~ ContractRenewal+DataPlan+CustServCalls+RoamMins,
data = train[,-1],family=binomial)
summary(logit)

Identifing overall fitness of model using log likelihood Ratio Test

library(lmtest)

lrtest(logit)
Coefficients importance
summary(logit)

Explanatory power of odds and probability

exp(coef(logit))

Building Confusion Matrix

prediction <- predict(logit,type = "response")

prediction

cutoff = floor(prediction+0.5)
cutoff

confmat = table(Predicted=cutoff,Actual=train$Churn)

confmat

Precision, Recall & F1 Score

install.packages("caret")

library(caret)

confusionMatrix(confmat,positive="1",mode="everything")

ROC Curve

library(pROC)

plot.roc(train$Churn,prediction)
auc(train$Churn,prediction)

Area under the curve: 0.749

Predicting for Test Data Set


prediction <- predict(logit,test,type = "response")

prediction

cutoff = floor(prediction+0.5)

cutoff

confmat = table(Predicted=cutoff,Actual=test$Churn)

confmat

Precision, Recall & F1 Score

install.packages("caret")

library(caret)

confusionMatrix(confmat,positive="1",mode="everything")
ROC Curve

plot.roc(test$Churn,prediction)
auc(test$Churn,prediction)

Area under the curve: 0.7269

You might also like