0% found this document useful (0 votes)
4 views

CS.IAABR

This study develops a customer satisfaction prediction model using a dataset of 76,020 survey responses, employing Linear Discriminant Analysis, Random Forest, and Support Vector Machine techniques. Despite over 80% of the data being missing, the models achieved more than 80% prediction accuracy, with LDA showing the highest overall accuracy but lower performance in predicting unsatisfied customers. The research highlights the importance of addressing data imbalances and the effectiveness of various classification methods in customer satisfaction analysis.

Uploaded by

haidar2.almohri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CS.IAABR

This study develops a customer satisfaction prediction model using a dataset of 76,020 survey responses, employing Linear Discriminant Analysis, Random Forest, and Support Vector Machine techniques. Despite over 80% of the data being missing, the models achieved more than 80% prediction accuracy, with LDA showing the highest overall accuracy but lower performance in predicting unsatisfied customers. The research highlights the importance of addressing data imbalances and the effectiveness of various classification methods in customer satisfaction analysis.

Uploaded by

haidar2.almohri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

UTILIZING LINEAR DISCRIMINANT ANALYSIS, RANDOM FOREST, AND

SUPPORT VECTOR MACHINE TO DEVELOP CUSTOMER SATISFACTION MODEL


Haidar Almohri, Wayne State University, Detroit, Michigan, USA
Mohammadjafar Esmaeili, University of Dayton, Dayton, Ohio, USA

ABSTRACT
One of the key-indicators of the success of customer-centric companies is their Customer satisfaction.
Customer satisfaction can be related to products or services that a company offers. Although businesses
spend a lot of resources to collect raw data regarding the customers’ behaviors, few would succeed in
converting the data to meaningful results. Rigorous and advance analysis of the collected data would be
required to convert the raw data into valuable and useful results. This study utilizes a historical data set,
which consist of survey responses of 76020 customers, to develop a model that can be used to predict
customer satisfaction. Different data classification techniques are used to compare the effectiveness of
each model and the results of comparison are reported. Despite the incompleteness of the data set (more
than 80% of the data is missing), we managed to achieve more than 80% prediction accuracy.

Keywords: Customer Satisfaction, Linear Discriminate Analysis, Random Forest, Support Vector
Machine

1. INTRODUCTION
One of the techniques that businesses utilize to identify the users’ satisfaction is using surveys. Although
using surveys is a convenient method to learn about customers’ expectations, it has its own challenges.
One of the main challenges is selecting appropriate methods to overcome the poor quality of the collected
data. Missing data and erroneous responses are two of the scenarios that could jeopardize the data
analysis. For these reasons, this research methodology formed into two main phases of preprocessing
and data analysis phases. In the pre-processing phase this research utilized two main techniques of down
sampling and exact matrix completion using soft thresholding (Candes and Benjamin Recht, 2009).
Finally, this study formed three models utilizing Linear Discriminate Analysis, Random Forest, and Soft
margin SVM to predict customers’ satisfaction.
2. PRE-PROCESSING OF THE DATA SET
A well-known bank provided the data set. The collected data is the result of surveys collected from 76020
customers. As it is usually the case with survey-based data, this data is also incomplete. The preliminary
analysis of the historical data reveals that around 90% of surveys have missing entries. The large
proportion of missing data increases the significant effects of data imputation. There are initially 336
columns that are used as predictors for our predictive model. The single response is the outcome of
whether a customer is satisfied (1) or not satisfied (0). To overcome the challenge of missing entries, this
research utilizes following steps to prepare the data prior to building the models:
2.1 Data Imputation and Down sampling
Imputation is the systematic process of replacing missing data with substituted values. There have been a
large number of studies focused on developing mathematical algorithms to perform imputation as a pre-
processing of the incomplete data. One of the well-known techniques that have been used and validated
in this field is exact matrix completion through soft thresholding. In a study conducted by Emmanuel J.
Candes and Benjamin Recht (2009), it is proved that a low-rank, nxn matrix with m observed entries could
be fully recovered with high probability, by solving a convex optimization problem if the inequality 2.1
holds true:
m≥ Cn1.2rlogn (2.1)
where C is a constant and r is the rank of the to-be-recovered matrix [2]. The algorithm states that for a
matrix M, if the set of observed entries is denoted as Ω {(i,j)  Ω if Mij is observed}, then M is recovered by
solving the convex optimization problem:

minimize∨¿ X ¿∨¿¿ ¿
Subject ¿ X ij =M ij ( i, j ) Ω(2.2)
where X is the recovered matrix, and ¿∨X ¿∨¿¿ ¿ is the nuclear norm of the matrix M (sum of its singular
values).
We applied the above algorithm using the softImpute package in R and the entire code along with the rest
of the analysis accessible through GitHub (Almohri, 2016).
We then removed the columns with near zero variance as well as the columns that have a correlation
value greater than 0.8 to account for multicollinearity of our model.
Also, there are number of outliers in each predictor that are removed during the pre-processing. An
observation is marked as an outlier if it falls beyond 95% confidence interval of its corresponding
predictor. That is, for an observation Xi(p) (ith observation of pth predictor):

|Xi(p)| ∊ X̅ (p)
±3σX(p) (2.3)
(p)
where X̅ and σX(p) are the mean and variance of the pth predictor.
After pre-processing, 119 columns remain as predictors in the data set that are fed to our predictive
model.
One of the challenges that are encountered in this work is that the labels in the data set are imbalanced,
i.e. the number of one class is highly larger than the other class. In particular, about 96% of observations
belong to class 0 (satisfied customers), while the rest of the observations belong to class 1 (unsatisfied
customers). The imbalanced data set can arise in many real-world problems. Examples include fraud
detection, network intrusion, rare disease diagnosing, etc. (Chen, Liaw, Breiman, 2004). The main
problem in modeling such population is that the general classifiers aim to minimize the classification error
rate, while in such problems there should be a special attention given to the rare class. As a result, the
misclassification error may be misleading. For example, if the model classifies all the observations as 0,
the error rate will be 4%, which obviously carries 100% false negative error. With increasing availability of
data, there has been an increasing attention to this problem in recent years (Chen, Liaw, Breiman, 2004;
Chawla, 2009; Provost, 2000).
The other technique that has been used during the training process is down sampling. Down sampling is
simply reducing the amount of over-represented class so that there will be a balance between the two
classes, e.g. half-half. Although this technique has a drawback, which is losing a fair amount of our data,
it has proved to be a very good strategy in practice. Hence, the data set is split into training (80%) and
testing (20%) using down sampling.
3. PREDICTION MODELING
Three different classification algorithms are applied to the data set to find the best model that produces
the best results. In this section, brief overviews of the utilized models as well as preliminary results are
presented. A more in-depth discussion about the results as well as the comparison between the models is
presented in the following section.
3.1 Linear Discriminant Analysis

Given the data X∈Rp, and labels Y, Linear Discriminant Analysis (LDA) is among the family of so called
“generative models”, which model the posterior distribution P(Y|X) by estimating joint distribution P(X,Y)
(in contrast of “discriminative models” such as logistic regression that model the posterior distribution P(Y|
X) directly). In this case, the prior probability of Y∈{0,1} is given by:
Y ~ Bernouli (π)
where π = P(Y=1|x), or simply the proportion of class 1 in the training data. The conditional probability of
X given Y is assumed to follow a Gaussian distribution:
P(X|Y) ~ N (μ,Σ)
where μ = {μ1, μ2,…, μp} is the vector of means and Σ = diag(σ12, σ22,…, σp2). Note that it is assumed that
the two classes share the same diagonal covariance (with different means).
The posterior probability is then calculated by

1
P ( Y =1|x )= − β x−α
T (3.1)
1+e
where

−1 −1 T −1 π
β=Σ ( μ1−μ0 ) , α =
2
( μ1−μ 0 ) Σ ( μ 1−μ0 ) + log
1−π
(3.2)

As it appears in equation 3.2, the posterior probability is influenced by the choice of prior probability (π).
Using this fact, and especially in the case of imbalanced data, one can fine-tune the prior (π), to optimize
the performance on the testing data. The optimal prior probabilities are selected by cross-validation. Table
1 shows the confusion matrix for this model on the testing data set.
Actual
0 1
Predicte

0 72% 27%
1 28% 73%
d

Table 1: Confusion matrix for LDA

3.2 Random Forest


Random forest is a bagging technique that is proved to be extremely powerful in both regression and
classification. It randomly selects a subset of the data for n number of times to generate n decision trees.
The new data point that needs to be classified is then predicted using each of n decision trees to generate
n results (votes). In classification problems, the predicted result is the class that acquires highest numbers
of votes. The result of applying random forest to our data set is shown in table 2.
Actual
0 1
Predicte

0 80% 28%
1 20% 72%
d

Table 2: Confusion matrix for Random Forest


3.3 Soft-margin SVM
Support Vector Machine (SVM) is a classification algorithm that is purely obtained by solving an
optimization problem. Assuming that the class labels Y∈ {-1,1}, SVM finds the hyperplane that best
separates the two classes. This is achieved by finding the maximum margin (maximum distance between
the two classes). This is called hard-margin SVM, which forces a rigid margin, i.e. a perfectly separable
margin or zero error. This method works perfect for linearly separable data. But almost none of the real
world classification problems are linearly separable. Hence, allowing some data points to cross the
separating hyperplane (some error) can help to build a more general model that more robustly fits the
testing data. The soft-margin SVM is achieved by introducing the slack variable. The slack variable is a
notion on how much we allow violation of the margin. By minimizing this variable, we can separate the
data as clean as possible. If we denote the data points¿ ¿, the distance to be minimized by w , and the
slack variable by φ i, then equation [3.3] shows the formulation of soft-margin SVM.
n
1
min ¿|w|∨¿ 2+ C ∑ φi ¿
2 i=1

T
s . t . y i w x i ≥ 1−φ i , φ i ≥ 0 (3.3)
where the constant C is the penalty term (a.k.a regularizer), that defines how much should each violation
be penalized when training the model.
Moreover, one can utilize the so-called “Kernel Trick” along with SVM. Applying Kernel is an extremely
powerful technique that can transfer a non-separable data to being separable in a different (higher)
dimension. The idea is to project the columns (features) of the data to a higher dimension using a
mapping function ϴ (.), which induces the corresponding kernel function K (Jakkula, 2006; Yu and Kim,
2012).
A soft-margin SVM using two different kernels (Radial Basis Function, and second order Polynomial) is fit
to our data set. Optimal parameters including the regularizer C is found using cross validation. Table 3
summarizes the result of the models.

Actual Actual
0 1 0 1
Predicte

Predicte

0 77% 36% 0 80% 28%


1 23% 64% 1 20% 72%
d

(a) (b)

Table 3: Confusion matrix for soft-margin SVM using: (a) RBF kernel (b) Polynomial kernel
4. ANALYSIS OF THE RESULTS
Table 4 provides detailed information about the four different classifiers that were used and mentioned in
previous section.

Classifier Accuracy (95% CI) Sensitivity (%) Specificity (%) Brier Score
LDA (82.2, 83.4) 83.7 62.2 0.17
SVM (RBF Kernel) (76.2, 77.6) 77.5 63.7 0.18
SVM (Polynomial) (71.4, 72.8) 72.2 68.2 0.18
Random Forest (79.7, 80.9) 80.7 71.8 0.16
Table 4: Detailed results for the classifiers
One of the most widely used methods for inspecting the performance of binary classifier is using
Receiving Operating Characteristic (ROC) curves. It is basically a way to compare the rate at which the
classifier makes true positive as well as false positive predictions. In particular, area under the curve
(AUC) is used to compare the mentioned classifiers. Figure 1 shows the ROC curve and table 5 illustrates
the AUC obtained for each classifier.
Figure 1: ROC curve for the classifiers

Classifier AUC
LDA 0.72
SVM (RBF Kernel) 0.79
SVM (Polynomial) 0.77
Random Forest 0.77
Table 5: AUC for the classifiers
5. DISCUSSION
Assessing the accuracy and comparing the classifiers is a very widespread problem (Labatut and Cherifi,
2012). Simply looking at the number of correctly predicted values doesn’t provide a strong baseline for
assessing the performance of a classifier, especially in the case of heavily unbalanced data. Although
looking at the sensitivity, specificity, and AUC provides some good insights about the performance of
classifiers, however, these methods fail to consider the assertion of class probabilities or ranking. There
are different proper scoring functions that measure the accuracy based on class probabilities. Among all,
Brier score is widely used for this purpose. Brier score is a simple quadratic scoring function, which
calculates the mean square error of a probability forecast. The most commonly used formula that is used
to calculate the Brier score is:
n
1
BS= ∑ ( p −o )2
n i=1 i i
(5.1)

where pi is the probability that the model is produced for observation i , and o i is the observed label. As a
result, a lower Brier score for a classifier shows a higher performance. Combining the absolute prediction
results with some proper scoring rule, e.g. Brier Score can provide more insights when assessing
classifiers.
Based on the results mentioned in previous section, LDA produced the highest overall accuracy (with
95% Confidence Interval of (82.2, 83.4)). However, most of its gain is obtained in predicting the true
positive class, which represents predicting the satisfied customers in this study. On the other hand, it has
the least accuracy in predicting the negative class (unsatisfied customers) as well as the least AUC. It
also had the second lowest Brier score.
Both SVM models (with RBF and second order polynomial kernels) produced similar results, with the
difference that using RBF kernel results in higher sensitivity (77.5%), whereas second order polynomial
provides higher specificity (68.2%). SVM with RBF kernel produced the highest AUC. They both had the
highest Brier score.
Random forest performed quite well on both sensitivity and specificity (second highest sensitivity and the
highest specificity). It provides a good balance between the two prediction classes. In general, random
forest is a very strong tool in situations where only the prediction and not explanation of the result is
required. Random forest also had the lowest Brier score.
6. CONCLUSION
Four different models for predicting customer satisfaction using a highly incomplete and imbalanced data
are presented. Ignoring some deviation, all models performed quite well on the testing data. Since each
classifier performed differently on predicting positive or negative classes, the best way to select the final
classifier is to provide a cost for misprediction. For example, if the cost of classifying a satisfied customer
as unsatisfied is proved to be higher than classifying an unsatisfied customer to satisfied, then one would
select LDA, which has the highest sensitivity. Otherwise, random forest can be selected.
The ROC curve can provide a baseline for the business operators to find the tradeoff between the
sensitivity and specificity on different cutoffs. The proper cutoff point can be achieved according to the
need of the business (the cost of type I and type II errors).
References
Almohri, H., Utilizing R programming to predict customers’ satisfactions utilizing advanced techniques,
(2016), GitHub repository
https://gist.github.com/mesmaeili1/48cb2aa5eb91b6704aac7b30e4bff831.
Candès, E.J. and Recht, B., 2009. Exact matrix completion via convex optimization. Foundations of
Computational mathematics, 9(6), pp.717-772.
Chawla, N.V., 2009. Data mining for imbalanced data sets: An overview. InData mining and knowledge
discovery handbook (pp. 875-886). Springer US.
Chen, C., Liaw, A. and Breiman, L., 2004. Using random forest to learn imbalanced data. University of
California, Berkeley.
Jakkula, V., 2006. Tutorial on support vector machine (svm). School of EECS, Washington State
University.
Labatut, V. and Cherifi, H., 2012. Accuracy measures for the comparison of classifiers. arXiv preprint
arXiv:1207.3790.
Provost, F., 2000, July. Machine learning from imbalanced data sets 101. InProceedings of the AAAI’2000
workshop on imbalanced data sets (pp. 1-3).
Yu, H. and Kim, S., 2012. Svm tutorial—classification, regression and ranking. In Handbook of Natural
computing (pp. 479-506). Springer Berlin Heidelberg.

You might also like