PM Project Logistic Regression LDA.docx
PM Project Logistic Regression LDA.docx
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
Attributes:
Q2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Ans 2.1:
The dataset relates to various attributes of 872 employees of a company along with
information on which employee bought or did not buy the Holiday package from the tour
company
The data has a column called “Unnamed: 0” which seems like S.No and is probably not of
any relevance for analysis. Hence, this column will be dropped from analysis in future tables
Other than that, the data has 7 variables:
o Holiday Package and Foreign seem to be categorical variables
o Salary, age, education, no of young children, no of older children appear to be numerical variables.
Holiday Package is the target variable and rest all are the independent variables
We can confirm the datatypes (as mentioned above) of the variables in the dataset using the “info”
function (from pandas package in python). The function also provides additional info on whether the
are any null values in any of the features as well as the total number of rows and columns in the data
Mean of “Salary” and “No of young children” variables is more than the median, indicating
right skewness in the data. We’ll see the magnitude of skewness in the univariate analysis
section
For all the other numerical variables (age, education, no of older children), mean and median
values are more or less same indicating that the data could be normally distributed. We will
confirm this through univariate analysis
Coefficient of Variation of Salary, Age and Education is less than 1, indicating that the data is
centered around mean and the skewness in data is less
Coefficient of Variation of # of young children and # of older children is more than 1 indicating
some skewness in the data
There are 2 sub-classes in Holiday Package and Foreign features. More information on the
distribution/frequency of sub-classes within each variable will be covered in the univariate
analysis section.
Mean/Median of:
o # of older children is 1
Minimum value in Salary feature is 1322. This value seems to be very low. Based on the
table below, there is only one row which has this value and seems like an anomaly given the
age and the number of education years of the employee. Hence, we will exclude this row
from the analysis:
Due to exclusion of above-mentioned row from the data, total number of row now are 871
400 employees opted for Holiday package, while 471 did not opt for it
216 employees are foreigners and 656 employees are locals
The data has no missing values or anomalies
Univariate Analysis (Skewness Score, Histogram and Boxplot of continuous variables)
Salary
Age
Education
Salary data is right skewed and has quite a few outliers on the higher side. But, these
outliers look genuine as this kind of dispersion in salary data in a company is normal where
majority of employees fall within a certain salary range, but certain employees at higher
levels (Manager and above) are paid relatively much higher. Hence, we will not be treating
these outliers
Age data looks to be normally distributed with no outliers
Education data also looks fairly normally distributed with very few outliers, but the outlier
values look plausible. Hence, we will not be treating the outliers in this case also
Below charts show the count of observations under each sub class of the categorical dimensions.
Observations:
1) Majority of the employees have 0 children, both in the “no of young children” and “no of
older children” features
2) In the “no of older children” feature, employees that have children, most have either 1 or
2 children. There are very few employees with 3 or more children
3) Majority of the employees are locals, while there are a few foreigners
Bivariate Analysis
Boxplots of Salary, Age and Education with split by “Holiday Package” buyers
Observations:
1) There does not seem to be significant difference in the median Salary, age and
education of employee groups who bought the package vs those who did not
2) In all the three cases, the median Salary, Age and Education of employees who did not
buy the package is marginally higher than that of employees who bought the package
Countplot of no. of young children, no. of older children and Foreign with split by “Holiday
Package” buyers
Observations:
1) Number of employees buying the package is slightly more for two groups:
2) For all the other sub-classes within both the features “no of young children” and “no of
older children”, the number of employees NOT buying the package is relatively more
3) Amongst local employees, buyers of package are less, while the number of buyers
amongst foreigners is more
Pair plot
Before splitting the data and fitting into the model, firstly all the categorical variables “Holliday
Package” and “Foreign” need to be converted into numerical values, since models like Logistic
Regression and Linear Discriminant Analysis can only take numerical values.
Upon conversion of categorical variables into numerical variables, the dataframe looks as below:
The next step is to extract the target column into a separate vector for the training as well as the test
set. Post the extraction step, independent variables are stored in the ‘X’ dataframe and the Target
column is stored in the ‘y’ dataframe
Both X (independent variables) and y (target variable) datasets will now be split into Test and Train
data (using the “train_test_split” function from the “sklearn” package in python) in a 70:30 proportion,
meaning the training dataset will have 70% data from the full dataset and the test data will have the
remaining 30% data. Post executing the above function, we get the following test and training sets
for X and y datasets
Training Data
Test Data
The distribution of target variable in classes 0 and 1 in the training and test data is consistent and it
is also in the same proportion as was in the full dataset before the split
Logistic Regression is defined as a statistical approach, for calculating the probability outputs for the
target labels. In its basic form it is used to classify binary data. Logistic regression is very much
similar to linear regression where the explanatory variables(X) are combined with weights to predict
a target variable of binary class(y). The main difference between linear regression and logistic
regression is of the type of the target variable, which in case of the latter is categorical
GridSearch function has been used to select the final tuning parameters from a range of values for
each parameter. Multiple values for each parameter were tested before arriving at the final best
parameter as shown below:
Building the Linear Discriminant Analysis (LDA) model
LDA and logistic regression are both multivariate statistical methods which are used to determine
relationships between different independent variables to the categorical dependent variable.
While, Logistic Regression has been explained above, In LDA, the orthogonal (perpendicular to each
other) discriminant functions are estimated such that it maximizes the difference of means between
the existing groups (class labels) while minimizing the standard deviation within the groups. Thus,
the predicted class for a data point will be the one that has the highest value for its corresponding
linear function.
In python, LDA model can be fit using the LinearDiscriminantAnalysis() function from scikit learn
package as below:
Importing the pakage and splitting the data into Test and Train sets in 70:30 ratio
Q2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is best/optimized.
Once, the model is fit, we will look at a few model performance measures for both the models to see
how good or bad the models are for making predictions. Key measures and terminologies used in
evaluating model performance include:
True Positives (TP): Predicted by model as Yes and are actually also No
The above four metrics are included in Confusion Matrix, from which the following ratios are
calculated to assess the accuracy of Predictions. The output of the below ratios can be seen in the
Classification report and the ROC plot. Key evaluation measures are:
Sensitivity or Recall: When it's actually yes, how often does it predict yes. Calculated as:
Precision: Among the points identified as Positive by the model, how many are really Positive
TP/(TP+FP)
Specificity: How many of the actual Negative data points are identified as negative by the model
TN/(TN/FP)
ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all
possible thresholds. It is generated by plotting the True Positive Rate (TP/Total Actual Positives) on
the y-axis against the False Positive Rate (FP/Total Actual Negatives) on the x-axis
AUC: AUC is an abbrevation for area under the curve in ROC curves. The closer AUC for a model
comes to 1, the better it is. So models with higher AUCs are preferred over those with lower AUCs.
Now we will evaluate the performance of each model using the above measures
Training Data:
Test Data:
Training Data:
Test Data:
1.4 AUC Score and AUC_ROC plots
Training Data:
Test Data:
Train Data:
AUC: 0.73
Accuracy: 0. 66
F1 Score: 0.59
Precision: 0.65
Test Data:
AUC: 0.73
Accuracy: 0.68
F1 Score: 0.62
Precision: 0.67
Observations
Training Data:
Test Data:
Training Data:
Test Data:
Training Data:
Test Data:
Overall Summary of the Key Measures of LDA Model
Train Data:
AUC: 0.73
Accuracy: 0. 65
F1 Score: 0.59
Precision: 0.64
Test Data:
AUC: 0.73
Accuracy: 0.66
F1 Score: 0.60
Precision: 0.66
Observations
Observation:
Upon comparing the key metrics across the two models, Logistic Regression seems marginally
better than the LDA model as its scores on Accuracy, F1 and Precision are slightly better than LDA.
AUC scores are same across the two models
Q2.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
Ans 2.4
Based on the above analysis, results from the Logistic Regression model are a tad bit better than
LDA. Further, the model seems to be fairly consistent in prediction with about 65% accuracy. Hence,
the company can implement Logistic Regression Model to start with.
While, the model performance currently is not the best and there is definitely scope for further
improvement in the same, but it’s not very poor either and it can still be implemented to make
predictions with about 65% accuracy about which employee might purchase the Holiday package or
not. Further, the company can try to improve the model performance over time as more data
becomes available.