Group Assignment - Predictive Modelling
Group Assignment - Predictive Modelling
Group Members:
Simran Saha
Srinidhi Narsimhan
Kalai Anbumani
Indushree AnandRaj
Deepak
Using Linear Discriminant Analysis to Predict Customer Churn
mydata=read.csv("GA_Dataset.csv",header=TRUE)
mydata
The dependent variable in the dataset is whether a customer has churn or not which means 1 if
customer cancelled service, 0 if not.
mydata$Churn = as.factor(mydata$Churn)
mydata$ContractRenewal = as.factor(mydata$ContractRenewal)
mydata$DataPlan = as.factor(mydata$DataPlan)
str(mydata)
table(mydata$Churn)
prop.table((table(mydata$Churn)))
Accountweeks:
summary(mydata$AccountWeeks)
ActWeeks <- cut(mydata$AccountWeeks, breaks = seq(0, 250, by = 25))
ggplot(mydata, aes(ActWeeks, ..count.., fill = Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=AccountWeeks, fill=Churn))+ geom_boxplot()
One would expect a decreasing churn rate with the increase in the time (account
weeks) of an account, but it does not seem to be the case. There is no clear trend
visible.
ContractRenewal:
DataPlan:
DataUsage:
summary(mydata$DataUsage)
dataUsage <- cut(mydata$DataUsage, include.lowest = TRUE, breaks = seq(0, 5.5, by
= 0.5))
ggplot(mydata, aes(dataUsage, ..count.., fill = Churn)) + geom_bar(position="dodge")
summary(mydata$CustServCall)
custServCalls <- cut(mydata$CustServCalls, include.lowest = TRUE, breaks = seq(0, 9,
by = 1))
ggplot(mydata, aes(CustServCalls, ..count.., fill = Churn)) +
geom_bar(position="dodge")
The churn rate increases if a customer makes 4 or more calls to the customer service.
DayCalls:
summary(mydata$DayCalls)
dayCalls <- cut(mydata$DayCalls, include.lowest = TRUE, breaks = seq(0, 165, by =
16.5))
ggplot(mydata, aes(dayCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")
DayMins:
summary(mydata$DayMins)
dayMins <- cut(mydata$DayMins, include.lowest = TRUE, breaks = seq(0, 385, by =
35))
ggplot(mydata, aes(dayMins, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=DayMins, fill=Churn))+ geom_boxplot()
The churn rate increases if the monthly average daytime minutes are greater than
245.
MonthlyCharge:
summary(mydata$MonthlyCharge)
monthlyCharge <- cut(mydata$MonthlyCharge, include.lowest = TRUE, breaks =
seq(14, 114, by = 10))
ggplot(mydata, aes(monthlyCharge, ..count.., fill =Churn)) +
geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=MonthlyCharge, fill=Churn))+ geom_boxplot()
The churn Rate is maximum if the monthly bill is between 64 and 74.
OverageFee:
summary(mydata$OverageFee)
overageFee <- cut(mydata$OverageFee, include.lowest = TRUE, breaks = seq(0, 19, by = 1.9))
ggplot(mydata, aes(overageFee, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=OverageFee, fill=Churn))+ geom_boxplot()
RoamMins:
summary(mydata$RoamMins)
roamMins <- cut(mydata$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
ggplot(mydata, aes(roamMins, ..count.., fill =Churn)) + geom_bar(position="dodge")
ggplot(mydata, aes(x=Churn, y=RoamMins, fill=Churn))+ geom_boxplot()
library(corrplot)
mydatanumeric = read.csv("GA_Dataset.csv",header=TRUE)
corrplot(cor(mydatanumeric))
Data Usage and Data Plan are highly correlated.
Monthly Charge is also highly correlated with Data Usage, Data Plan and Day Mins.
Churn does not seem to be highly correlated with any of the variables.
Churn has maximum correlation with Contract Renewal, Customer Service Calls and Day
Mins.
3. Building the model based on the training dataset and test the model performance
using the test dataset.
set.seed(101)
sample <- sample.int(n = nrow(mydata), size = floor(.70*nrow(mydata)), replace = F)
inTrain = createDataPartition(y = Churn, p = 0.7, list = FALSE )
train <- mydata[sample, ]
test <- mydata[-sample, ]
Data has been split train has 2333 observations - which is 70 % of my data, test has 1000 obs.
Testing the significance of the Discriminant Function Using MANOVA:
1. The null hypothesis of MANOVA is that all the means of the independent variables are
equal, which implies that the independent variables are not differentiators of the group.
2. The alternative hypothesis is that at least one independent variable has a different mean
or, in other words, a significant differentiator.
head(train)
Xm=cbind(as.matrix(train[,2:11]))
Ym=as.vector(train[,1])
Manova=manova(Xm~Ym)
summary(Manova, test = "Wilks")
As we can see, the Wilks' lambda for MANOVA is closer to 1, indicating that the extent of
discrimination in the model is relatively low. But the p-value is highly significant, indicating that the
null hypothesis cannot be accepted. This implies that the discriminant model is highly significant.
Fisher=desDA(X,Y)
Fisher
print(Fisher$power,digits = 4)
According to the discrivarvalues, Data Usage has the maximum impact and account weeks the least.
Constant and contract renewal seems to have high negative impact on churn.
Mahalanobis=linDA(X,Y)
Mahalanobis
Score = Mahalanobis$scores
Score
View(Score)
write.csv(Score,file="C:\\Users\\sisaha\\Documents\\Data Science\\Great Lakes\\PGP-
BABI\\Predictive Modeling\\Group Assignment\\groupassignmentmahalanobisscores.csv")
groupassignmentma
halanobisscores.csv
Classify records based on Discriminant Analysis of X variables and predict Y
variables for the training data.
library(MASS)
X1=as.matrix(mydata[,2:11])
Y1=as.vector(mydata[,1])
Jack=lda(Y1~X1)
Jacknife=table(Actual=Churn,Predicted=predict(Jack)$class)
Jacknife
sublda=lda(Churn~.,data=train)
sublda
The difference in group means is highest for number of customer service calls and lowest for
accountweeks. This gives us insight into which factors contribute most to the discrimination between
the groups.
The coefficients also give a similar pattern and throw light on which X variables contribute most to
group separation.
plot(sublda, dimen=1, type="b")
The groups created by discriminant analysis can be seen in the graphs, and are in sync with the Wilks
lambda value of 0.83 that we got from our MANOVA test. These graphs are a good indicator that
although the model is significant, our two groups are not completely separated. There is some overlap.
library(hmeasure)
class.lda=lda.pred$class
true.class<-test[,1]
lda.counts <- misclassCounts(class.lda,true.class)
lda.counts$conf.matrix
print(lda.counts$metrics,digits=3)
Evaluate Model Performance Measures:
Errorrate=0.158
Accuracyofldamodel = 1-Errorrate
Accuracyofldamodel
Accuracy indicates how many correct predictions are made by the model. The model has a fairly good
accuracy. However, since the dataset is unbalanced, accuracy alone may not be the sole indicator that
the model is a robust model. Hence we will look at few other model performance measures.
Sensitivity also called the true positive rate is defined as the proportion of actual positives that are
correctly identified by the model. The sensitivity of the model is 22.5 % which is relatively low. For a
churn prediction model, it is important that the model picks up positives as positives. It is important to
make an accurate prediction of customers who will churn which is given by sensitivity.
We will now vary the threshold of the model from the default 50% to other values to decide on an
optimum balance for sensitivity and specificity.
lda.pred$posterior[1:3,]
Threshold of 30%
lda.counts.T03 <- misclassCounts(scores.lda>0.3,true.class)
lda.counts.T03$conf.matrix
Threshold of 20%
lda.counts.T02 <- misclassCounts(scores.lda>0.2,true.class)
lda.counts.T02$conf.matrix
Threshold of 17%
lda.counts.T017 <- misclassCounts(scores.lda>0.17,true.class)
lda.counts.T017$conf.matrix
Threshold of 16%
Threshold of 15%
lda.counts.T015 <- misclassCounts(scores.lda>0.15,true.class)
lda.counts.T015$conf.matrix
Threshold of 10%
lda.counts.T01 <- misclassCounts(scores.lda>0.1,true.class)
lda.counts.T01$conf.matrix
Now, let's compare the values of sensitivity and specificity for three threshold values.
lda.counts.T02$metrics[c('ER', 'Sens','Spec')]
lda.counts.T017$metrics[c('ER', 'Sens','Spec')]
lda.counts.T016$metrics[c('ER', 'Sens','Spec')]
lda.counts.T015$metrics[c('ER', 'Sens','Spec')]
lda.counts.T01$metrics[c('ER', 'Sens','Spec')]
Based on the discriminant coefficients and the correl_ratio provided by the model, an
increase in the below variables increases the probability of customer churn:
These insights from the discriminant model can help the business formulate
strategies to reduce customer churn. Here's what I would recommend to the
business based on what we've learned: First, customer issues should be resolved
within the first or second call, as repeated calls to customer service causes
customer churn. Second, there should be an organized escalation procedure for
issues not resolved within two calls. Lastly, the provider should offer more
attractive plans that reduce the cost of day, evening, and international calls
based on usage.
1. Building the model based on the training dataset and test the model performance
using the test dataset
names(train)
attach(train)
library(e1071)
library(ROCR)
library(pROC)
SVMModel=svm(Churn~AccountWeeks+ContractRenewal+DataPlan+DataUsage+CustServCalls+DayMins+DayC
alls+MonthlyCharge+OverageFee+RoamMins, data=train ,kernel="radial", scale=FALSE)
prediction=predict(SVMModel)
prediction
table(Actual=train$Churn,Predicted=prediction)