0% found this document useful (0 votes)
126 views

Group Assignment - Predictive Modelling

The document describes a group project to predict customer churn using linear discriminant analysis. It involves: 1) Exploring customer data to understand impact of variables like contract renewal and data usage on churn. 2) Splitting data into 70% training and 30% test sets. 3) Performing discriminant analysis on training data to identify variables that best differentiate churners from non-churners. 4) Using discriminant functions to classify records and predict churn for the training data. Performance is evaluated using a confusion matrix.

Uploaded by

Simran Saha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Group Assignment - Predictive Modelling

The document describes a group project to predict customer churn using linear discriminant analysis. It involves: 1) Exploring customer data to understand impact of variables like contract renewal and data usage on churn. 2) Splitting data into 70% training and 30% test sets. 3) Performing discriminant analysis on training data to identify variables that best differentiate churners from non-churners. 4) Using discriminant functions to classify records and predict churn for the training data. Performance is evaluated using a confusion matrix.

Uploaded by

Simran Saha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

GROUP ASSIGNMENT: PREDICTIVE MODELLING

TOPIC: Predicting Customer Churn based on SVM and Discriminant


Analysis

Group Members:
Simran Saha
Srinidhi Narsimhan
Kalai Anbumani
Indushree AnandRaj
Deepak
Using Linear Discriminant Analysis to Predict Customer Churn

1. Reading the data:

mydata=read.csv("GA_Dataset.csv",header=TRUE)
mydata

The dependent variable in the dataset is whether a customer has churn or not which means 1 if
customer cancelled service, 0 if not.

2. Exploring the data:

 Check for Missing Values:


sum(is.na(mydata))

Since the sum of missing values is 0, there is no NA

 Convert Churn, ContractRenewal and DataPlan variables to factors:

mydata$Churn = as.factor(mydata$Churn)
mydata$ContractRenewal = as.factor(mydata$ContractRenewal)
mydata$DataPlan = as.factor(mydata$DataPlan)
str(mydata)

 Identifying the Baseline Churn rate:

table(mydata$Churn)
prop.table((table(mydata$Churn)))

Baseline churn rate is 14.5% which is an indicator that the dataset is


unbalanced — there are more 0s than 1s. 
 Exploratory Data Analysis to explore our dataset and determine the impact of the variables
on Churn:
 Churn:

ggplot(mydata, aes(x=Churn))+ geom_bar(fill="blue")

14.5% is the churn rate, 483/3333 have churned.

 Accountweeks:

summary(mydata$AccountWeeks)
ActWeeks <- cut(mydata$AccountWeeks, breaks = seq(0, 250, by = 25))
ggplot(mydata, aes(ActWeeks, ..count.., fill = Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=AccountWeeks, fill=Churn))+ geom_boxplot()

One would expect a decreasing churn rate with the increase in the time (account
weeks) of an account, but it does not seem to be the case. There is no clear trend
visible.

 ContractRenewal:

ggplot(mydata, aes(ContractRenewal, ..count.., fill = Churn)) +


geom_bar(position="dodge")

ggplot(mydata, aes(x=Churn, y=ContractRenewal, fill=Churn))+ geom_boxplot()


Clearly, there is a good probability (approx 40%) of an account churning if the
contract has not been renewed.

 DataPlan:

ggplot(mydata, aes(DataPlan, ..count.., fill =Churn)) + geom_bar(position="dodge")

ggplot(mydata, aes(x=Churn, y=DataPlan, fill=Churn))+ geom_boxplot()


The probability of an account churning is higher if the account has not subscribed to
a data plan.

 DataUsage:

summary(mydata$DataUsage)
dataUsage <- cut(mydata$DataUsage, include.lowest = TRUE, breaks = seq(0, 5.5, by
= 0.5))
ggplot(mydata, aes(dataUsage, ..count.., fill = Churn)) + geom_bar(position="dodge")

ggplot(mydata, aes(x=Churn, y=DataUsage, fill=Churn))+ geom_boxplot()

Clearly, maximum churn is in the 0-0.5 data usage category.


 CustServCalls:

summary(mydata$CustServCall)
custServCalls <- cut(mydata$CustServCalls, include.lowest = TRUE, breaks = seq(0, 9,
by = 1))
ggplot(mydata, aes(CustServCalls, ..count.., fill = Churn)) +
geom_bar(position="dodge")

ggplot(mydata, aes(x=Churn, y=CustServCalls, fill=Churn))+ geom_boxplot()

The churn rate increases if a customer makes 4 or more calls to the customer service.
 DayCalls:

summary(mydata$DayCalls)
dayCalls <- cut(mydata$DayCalls, include.lowest = TRUE, breaks = seq(0, 165, by =
16.5))
ggplot(mydata, aes(dayCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")

ggplot(mydata, aes(x=Churn, y=DayCalls, fill=Churn))+ geom_boxplot()

There is no clear churn pattern visible for day calls.

 DayMins:

summary(mydata$DayMins)
dayMins <- cut(mydata$DayMins, include.lowest = TRUE, breaks = seq(0, 385, by =
35))
ggplot(mydata, aes(dayMins, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=DayMins, fill=Churn))+ geom_boxplot()

The churn rate increases if the monthly average daytime minutes are greater than
245.

 MonthlyCharge:

summary(mydata$MonthlyCharge)
monthlyCharge <- cut(mydata$MonthlyCharge, include.lowest = TRUE, breaks =
seq(14, 114, by = 10))
ggplot(mydata, aes(monthlyCharge, ..count.., fill =Churn)) +
geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=MonthlyCharge, fill=Churn))+ geom_boxplot()

The churn Rate is maximum if the monthly bill is between 64 and 74.

 OverageFee:

summary(mydata$OverageFee)
overageFee <- cut(mydata$OverageFee, include.lowest = TRUE, breaks = seq(0, 19, by = 1.9))
ggplot(mydata, aes(overageFee, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=OverageFee, fill=Churn))+ geom_boxplot()

There is no clear churn pattern visible for Overage Fee.

 RoamMins:

summary(mydata$RoamMins)
roamMins <- cut(mydata$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
ggplot(mydata, aes(roamMins, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=RoamMins, fill=Churn))+ geom_boxplot()

There is no clear churn pattern visible for roaming minutes.

 Correlation between the variables:

library(corrplot)
mydatanumeric = read.csv("GA_Dataset.csv",header=TRUE)
corrplot(cor(mydatanumeric))
Data Usage and Data Plan are highly correlated.
Monthly Charge is also highly correlated with Data Usage, Data Plan and Day Mins.
Churn does not seem to be highly correlated with any of the variables.
Churn has maximum correlation with Contract Renewal, Customer Service Calls and Day
Mins.

3. Building the model based on the training dataset and test the model performance
using the test dataset.

set.seed(101)
sample <- sample.int(n = nrow(mydata), size = floor(.70*nrow(mydata)), replace = F)
inTrain = createDataPartition(y = Churn, p = 0.7, list = FALSE )
train <- mydata[sample, ]
test <- mydata[-sample, ]

Data has been split train has 2333 observations - which is 70 % of my data, test has 1000 obs.
 Testing the significance of the Discriminant Function Using MANOVA:

1. The null hypothesis of MANOVA is that all the means of the independent variables are
equal, which implies that the independent variables are not differentiators of the group.
2. The alternative hypothesis is that at least one independent variable has a different mean
or, in other words, a significant differentiator. 

head(train)
Xm=cbind(as.matrix(train[,2:11]))
Ym=as.vector(train[,1])
Manova=manova(Xm~Ym)
summary(Manova, test = "Wilks")

As we can see, the Wilks' lambda for MANOVA is closer to 1, indicating that the extent of
discrimination in the model is relatively low. But the p-value is highly significant, indicating that the
null hypothesis cannot be accepted. This implies that the discriminant model is highly significant.

 Discriminant Analysis on Training data(Fisher and Mahalanobis)


cor(train[,2:11])
Pair=c(train[,2],train[,3],train[,4],train[,5],train[,6],train[,7],train[,8],train[,9],train[,10],train[,11])
X=train[,2:11]
Y=train[,1]
library(DiscriMiner)
discPower(X,Y)

Fisher=desDA(X,Y)
Fisher
print(Fisher$power,digits = 4)

According to the discrivarvalues, Data Usage has the maximum impact and account weeks the least.
Constant and contract renewal seems to have high negative impact on churn.

Mahalanobis=linDA(X,Y)
Mahalanobis
Score = Mahalanobis$scores
Score
View(Score)
write.csv(Score,file="C:\\Users\\sisaha\\Documents\\Data Science\\Great Lakes\\PGP-
BABI\\Predictive Modeling\\Group Assignment\\groupassignmentmahalanobisscores.csv")

groupassignmentma
halanobisscores.csv
 Classify records based on Discriminant Analysis of X variables and predict Y
variables for the training data.

library(MASS)
X1=as.matrix(mydata[,2:11])
Y1=as.vector(mydata[,1])

Jack=lda(Y1~X1)
Jacknife=table(Actual=Churn,Predicted=predict(Jack)$class)
Jacknife

sublda=lda(Churn~.,data=train)
sublda

The difference in group means is highest for number of customer service calls and lowest for
accountweeks. This gives us insight into which factors contribute most to the discrimination between
the groups.

The coefficients also give a similar pattern and throw light on which X variables contribute most to
group separation.
plot(sublda, dimen=1, type="b")

The groups created by discriminant analysis can be seen in the graphs, and are in sync with the Wilks
lambda value of 0.83 that we got from our MANOVA test. These graphs are a good indicator that
although the model is significant, our two groups are not completely separated. There is some overlap.

 Predictions on the Test Set:

lda.pred=predict(sublda, newdata = test)

library(hmeasure)

class.lda=lda.pred$class
true.class<-test[,1]
lda.counts <- misclassCounts(class.lda,true.class)
lda.counts$conf.matrix

print(lda.counts$metrics,digits=3)
 Evaluate Model Performance Measures:

The accuracy of the model is 1-Error rate

Errorrate=0.158
Accuracyofldamodel = 1-Errorrate
Accuracyofldamodel

Accuracy indicates how many correct predictions are made by the model. The model has a fairly good
accuracy. However, since the dataset is unbalanced, accuracy alone may not be the sole indicator that
the model is a robust model. Hence we will look at few other model performance measures.

Sensitivity also called the true positive rate is defined as the proportion of actual positives that are
correctly identified by the model. The sensitivity of the model is 22.5 % which is relatively low. For a
churn prediction model, it is important that the model picks up positives as positives. It is important to
make an accurate prediction of customers who will churn which is given by sensitivity.

We will now vary the threshold of the model from the default 50% to other values to decide on an
optimum balance for sensitivity and specificity.

 Classifying with a default threshold of 0.5:

lda.pred$posterior[1:3,]

scores.lda <- lda.pred$posterior[,2]


all((scores.lda > 0.5)== (class.lda=="1"))

The model, by default, uses a 50% threshold to classify records as 0 or 1

Threshold of 30%
lda.counts.T03 <- misclassCounts(scores.lda>0.3,true.class)
lda.counts.T03$conf.matrix

Threshold of 20%
lda.counts.T02 <- misclassCounts(scores.lda>0.2,true.class)
lda.counts.T02$conf.matrix

Threshold of 17%
lda.counts.T017 <- misclassCounts(scores.lda>0.17,true.class)
lda.counts.T017$conf.matrix

Threshold of 16%

lda.counts.T016 <- misclassCounts(scores.lda>0.16,true.class)


lda.counts.T016$conf.matrix

Threshold of 15%
lda.counts.T015 <- misclassCounts(scores.lda>0.15,true.class)
lda.counts.T015$conf.matrix

Threshold of 10%
lda.counts.T01 <- misclassCounts(scores.lda>0.1,true.class)
lda.counts.T01$conf.matrix

Now, let's compare the values of sensitivity and specificity for three threshold values.

lda.counts.T02$metrics[c('ER', 'Sens','Spec')]
lda.counts.T017$metrics[c('ER', 'Sens','Spec')]

lda.counts.T016$metrics[c('ER', 'Sens','Spec')]

lda.counts.T015$metrics[c('ER', 'Sens','Spec')]

lda.counts.T01$metrics[c('ER', 'Sens','Spec')]

Based on the discriminant coefficients and the correl_ratio provided by the model, an
increase in the below variables increases the probability of customer churn:

Number of customer service calls


Total day charge

These insights from the discriminant model can help the business formulate
strategies to reduce customer churn. Here's what I would recommend to the
business based on what we've learned: First, customer issues should be resolved
within the first or second call, as repeated calls to customer service causes
customer churn. Second, there should be an organized escalation procedure for
issues not resolved within two calls. Lastly, the provider should offer more
attractive plans that reduce the cost of day, evening, and international calls
based on usage.

Using SVM to Predict Customer Churn

1. Building the model based on the training dataset and test the model performance
using the test dataset
names(train)

attach(train)
library(e1071)
library(ROCR)
library(pROC)
SVMModel=svm(Churn~AccountWeeks+ContractRenewal+DataPlan+DataUsage+CustServCalls+DayMins+DayC
alls+MonthlyCharge+OverageFee+RoamMins, data=train ,kernel="radial", scale=FALSE)

prediction=predict(SVMModel)
prediction
table(Actual=train$Churn,Predicted=prediction)

 Checking accuracy of model with training data set:

train$predSVM <- predict(SVMModel,train)


train$predSVM
ROCRpredSVM <- prediction(train$predSVM,train$Churn)
ROCRperfSVM <- performance(ROCRpredSVM,'tpr','fpr')
plot(ROCRperfSVM)

aucSVM <- performance(ROCRpredSVM,measure="auc")


aucSVM <- [email protected][[1]]
aucSVM

 Checking accuracy of model with test data set:

test$predSVM <- predict(SVMModel,test)


test$predSVM
ROCRpredSVMtest <- prediction(test$predSVM,test$Churn)
ROCRperfSVMtest <- performance(ROCRpredSVMtest,'tpr','fpr')
plot(ROCRperfSVMtest)

aucSVMtest <- performance(ROCRpredSVMtest,measure="auc")


aucSVMtest <- [email protected][[1]]
aucSVMtest

You might also like