0% found this document useful (0 votes)
3 views22 pages

PM Project Logistic Regression LDA.docx

The document outlines a project involving data analysis and modeling to predict employee interest in holiday packages based on a dataset of 872 employees. It details the steps taken, including data ingestion, exploratory data analysis, and the application of Logistic Regression and Linear Discriminant Analysis (LDA) for modeling. The evaluation of model performance indicates that Logistic Regression outperforms LDA slightly, with both models achieving around 65-68% accuracy, leading to business insights and recommendations for the tour and travel agency.

Uploaded by

subhadeepseal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

PM Project Logistic Regression LDA.docx

The document outlines a project involving data analysis and modeling to predict employee interest in holiday packages based on a dataset of 872 employees. It details the steps taken, including data ingestion, exploratory data analysis, and the application of Logistic Regression and Linear Discriminant Analysis (LDA) for modeling. The evaluation of model performance indicates that Logistic Regression outperforms LDA slightly, with both models achieving around 65-68% accuracy, leading to business insights and recommendations for the tour and travel agency.

Uploaded by

subhadeepseal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Table of Contents

S. No. Item Page No.


1 Problem Statement 2
Q1: Data Ingestion: Read the dataset. Do the descriptive statistics
2 and do null value condition check, write an inference on it. Perform 2
Univariate and Bivariate Analysis. Do exploratory data analysis.
3 A1: Dataset Sample, Info, Descriptive Statistics and other EDA 2-9
Q2: Do not scale the data. Encode the data (having string values)
4 for Modelling. Data Split: Split the data into train and test (70:30). 10
Apply Logistic Regression and LDA (linear discriminant analysis).
5 A2: Response to Question 2 10-12
Q3: Q2.3 Performance Metrics: Check the performance of
Predictions on Train and Test sets using Accuracy, Confusion
6 Matrix, Plot ROC curve and get ROC_AUC score for each model 12
Final Model: Compare Both the models and write inference which
model is best/optimized..
7 A3: Model performance measures and model comparison 12-18
Q4: Inference: Basis on these predictions, what are the business
8 19
insights and recommendations.
9 A4: Final insights and business recommendations 19
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.

Attributes:

Q2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.

Ans 2.1:

Sample of the dataset:

 The dataset relates to various attributes of 872 employees of a company along with
information on which employee bought or did not buy the Holiday package from the tour
company
 The data has a column called “Unnamed: 0” which seems like S.No and is probably not of
any relevance for analysis. Hence, this column will be dropped from analysis in future tables
 Other than that, the data has 7 variables:
o Holiday Package and Foreign seem to be categorical variables
o Salary, age, education, no of young children, no of older children appear to be numerical variables.
 Holiday Package is the target variable and rest all are the independent variables
We can confirm the datatypes (as mentioned above) of the variables in the dataset using the “info”
function (from pandas package in python). The function also provides additional info on whether the
are any null values in any of the features as well as the total number of rows and columns in the data

The above table shows:

 The datatype of the variables is same as mentioned in the previous section


o Holiday Package and Foreign are categorical variables
o Salary, age, education, no of young children, no of older children are numerical variables

 None of the features have missing values


 The data has 872 rows and 7 columns
 The data has also been checked for duplicate values and there are no duplicate values in
the data

Using methods of descriptive statistics to describe the data

Coefficient of Variation of numerical variables:


Observations from descriptive statistics:

 Mean of “Salary” and “No of young children” variables is more than the median, indicating
right skewness in the data. We’ll see the magnitude of skewness in the univariate analysis
section

 For all the other numerical variables (age, education, no of older children), mean and median
values are more or less same indicating that the data could be normally distributed. We will
confirm this through univariate analysis

 Coefficient of Variation of Salary, Age and Education is less than 1, indicating that the data is
centered around mean and the skewness in data is less

 Coefficient of Variation of # of young children and # of older children is more than 1 indicating
some skewness in the data

 There are 2 sub-classes in Holiday Package and Foreign features. More information on the
distribution/frequency of sub-classes within each variable will be covered in the univariate
analysis section.

 Mean/Median of:

o age of the employees is 39 years

o # of young children is 0, and

o # of older children is 1

 Minimum value in Salary feature is 1322. This value seems to be very low. Based on the
table below, there is only one row which has this value and seems like an anomaly given the
age and the number of education years of the employee. Hence, we will exclude this row
from the analysis:

Checking the unique value of categorical variables

 Due to exclusion of above-mentioned row from the data, total number of row now are 871
 400 employees opted for Holiday package, while 471 did not opt for it
 216 employees are foreigners and 656 employees are locals
 The data has no missing values or anomalies
Univariate Analysis (Skewness Score, Histogram and Boxplot of continuous variables)

Salary

Age
Education

Univariate analysis of numerical variables shows that:

 Salary data is right skewed and has quite a few outliers on the higher side. But, these
outliers look genuine as this kind of dispersion in salary data in a company is normal where
majority of employees fall within a certain salary range, but certain employees at higher
levels (Manager and above) are paid relatively much higher. Hence, we will not be treating
these outliers
 Age data looks to be normally distributed with no outliers
 Education data also looks fairly normally distributed with very few outliers, but the outlier
values look plausible. Hence, we will not be treating the outliers in this case also

Countplot of the remaining features

Below charts show the count of observations under each sub class of the categorical dimensions.
Observations:

1) Majority of the employees have 0 children, both in the “no of young children” and “no of
older children” features

2) In the “no of older children” feature, employees that have children, most have either 1 or
2 children. There are very few employees with 3 or more children

3) Majority of the employees are locals, while there are a few foreigners
Bivariate Analysis

Boxplots of Salary, Age and Education with split by “Holiday Package” buyers

Observations:

1) There does not seem to be significant difference in the median Salary, age and
education of employee groups who bought the package vs those who did not

2) In all the three cases, the median Salary, Age and Education of employees who did not
buy the package is marginally higher than that of employees who bought the package
Countplot of no. of young children, no. of older children and Foreign with split by “Holiday
Package” buyers

Observations:

1) Number of employees buying the package is slightly more for two groups:

a. Those with 0 # of young children

b. Those with 2, 3 or 6 older children

2) For all the other sub-classes within both the features “no of young children” and “no of
older children”, the number of employees NOT buying the package is relatively more

3) Amongst local employees, buyers of package are less, while the number of buyers
amongst foreigners is more
Pair plot

Correlation Heat Map

Bivariate analysis shows some degree of correlation between:


- Positive correlation between Education and salary
- Negative correlation between No of young children and age
- Positive correlation between no of young children and no of older children

There seems to be little or no correlation between the remaining variables


2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).

Before splitting the data and fitting into the model, firstly all the categorical variables “Holliday
Package” and “Foreign” need to be converted into numerical values, since models like Logistic
Regression and Linear Discriminant Analysis can only take numerical values.

Upon conversion of categorical variables into numerical variables, the dataframe looks as below:

- 0 value in foreign_yes column indicate the employee is local

- 0 value in Holiday_Package indicates non-buyer and 1 indicates buyer of the Holiday


Package

The next step is to extract the target column into a separate vector for the training as well as the test
set. Post the extraction step, independent variables are stored in the ‘X’ dataframe and the Target
column is stored in the ‘y’ dataframe

Independent Variables Dataset Sample

Dependent Variables Dataset Sample

Both X (independent variables) and y (target variable) datasets will now be split into Test and Train
data (using the “train_test_split” function from the “sklearn” package in python) in a 70:30 proportion,
meaning the training dataset will have 70% data from the full dataset and the test data will have the
remaining 30% data. Post executing the above function, we get the following test and training sets
for X and y datasets

 X_train has 609 rows and 6 columns


 X_test has 262 rows and 6 columns
 y_train has 609 rows and 1 column
 y_test has 262 rows and 1 column

Distribution of target variable in classes 0 and 1

Training Data

Test Data

The distribution of target variable in classes 0 and 1 in the training and test data is consistent and it
is also in the same proportion as was in the full dataset before the split

Building the Logistic Regression model

Logistic Regression is defined as a statistical approach, for calculating the probability outputs for the
target labels. In its basic form it is used to classify binary data. Logistic regression is very much
similar to linear regression where the explanatory variables(X) are combined with weights to predict
a target variable of binary class(y). The main difference between linear regression and logistic
regression is of the type of the target variable, which in case of the latter is categorical

The model is built using the following parameters:

GridSearch function has been used to select the final tuning parameters from a range of values for
each parameter. Multiple values for each parameter were tested before arriving at the final best
parameter as shown below:
Building the Linear Discriminant Analysis (LDA) model

LDA and logistic regression are both multivariate statistical methods which are used to determine
relationships between different independent variables to the categorical dependent variable.

While, Logistic Regression has been explained above, In LDA, the orthogonal (perpendicular to each
other) discriminant functions are estimated such that it maximizes the difference of means between
the existing groups (class labels) while minimizing the standard deviation within the groups. Thus,
the predicted class for a data point will be the one that has the highest value for its corresponding
linear function.

In python, LDA model can be fit using the LinearDiscriminantAnalysis() function from scikit learn
package as below:

Importing the pakage and splitting the data into Test and Train sets in 70:30 ratio

Fitting the Training data into the model

Q2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is best/optimized.

Ans 2.3 Evaluating Model Performance

Once, the model is fit, we will look at a few model performance measures for both the models to see
how good or bad the models are for making predictions. Key measures and terminologies used in
evaluating model performance include:
True Positives (TP): Predicted by model as Yes and are actually also No

True Negatives (TN): Predicted by model as No and are actually also No

False Positives (FP): Predicted by Model as Yes but are actually No

False Negative (FN): Predicted by Model as No but are actually Yes

The above four metrics are included in Confusion Matrix, from which the following ratios are
calculated to assess the accuracy of Predictions. The output of the below ratios can be seen in the
Classification report and the ROC plot. Key evaluation measures are:

Accuracy: Measures how often is the model correct. Calculated as:

(TP+ TN)/Total observations

Sensitivity or Recall: When it's actually yes, how often does it predict yes. Calculated as:

TP/(TP+FN) (Also known True Postivity Rate (TPR))

Precision: Among the points identified as Positive by the model, how many are really Positive

TP/(TP+FP)

Specificity: How many of the actual Negative data points are identified as negative by the model

TN/(TN/FP)

F-Score: Harmonic mean of the Recall and Precision

2* precision*recall /(precision + recall)

ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all
possible thresholds. It is generated by plotting the True Positive Rate (TP/Total Actual Positives) on
the y-axis against the False Positive Rate (FP/Total Actual Negatives) on the x-axis

AUC: AUC is an abbrevation for area under the curve in ROC curves. The closer AUC for a model
comes to 1, the better it is. So models with higher AUCs are preferred over those with lower AUCs.

Now we will evaluate the performance of each model using the above measures

Model 1: Logistic Regression

1.1 Accuracy Score

Training Data: 0.66


Test Data: 0.68
1.2 Confusion Matrix

Training Data:

Test Data:

1.3 Classification Reports:

Training Data:

Test Data:
1.4 AUC Score and AUC_ROC plots

Training Data:

Test Data:

Overall Summary of the Key Measures of Logistic Regression Model

Train Data:
AUC: 0.73
Accuracy: 0. 66
F1 Score: 0.59
Precision: 0.65

Test Data:
AUC: 0.73
Accuracy: 0.68
F1 Score: 0.62
Precision: 0.67

Observations

 Training and Test set results are almost similar


 The Model’s accuracy is slightly above 65%
 The scores across the other metrices are also moderately good and fairly consistent across
the train and test data
Model 2: LDA

2.1 Accuracy Score

Training Data: 0.65


Test Data: 0.65

2.2 Confusion Matrix

Training Data:

Test Data:

2.3 Classification Reports:

Training Data:
Test Data:

2.4 AUC Score and AUC_ROC plots

Training Data:

Test Data:
Overall Summary of the Key Measures of LDA Model

Train Data:
AUC: 0.73
Accuracy: 0. 65
F1 Score: 0.59
Precision: 0.64

Test Data:
AUC: 0.73
Accuracy: 0.66
F1 Score: 0.60
Precision: 0.66

Observations

 The model’s accuracy is around 65%


 The scores across the other metrices are moderately good and fairly consistent across the
test and training data

Comparison of the key performance measures across the two models

Logistic Regression LDA

Observation:

Upon comparing the key metrics across the two models, Logistic Regression seems marginally
better than the LDA model as its scores on Accuracy, F1 and Precision are slightly better than LDA.
AUC scores are same across the two models
Q2.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Ans 2.4

Based on the above analysis, results from the Logistic Regression model are a tad bit better than
LDA. Further, the model seems to be fairly consistent in prediction with about 65% accuracy. Hence,
the company can implement Logistic Regression Model to start with.

While, the model performance currently is not the best and there is definitely scope for further
improvement in the same, but it’s not very poor either and it can still be implemented to make
predictions with about 65% accuracy about which employee might purchase the Holiday package or
not. Further, the company can try to improve the model performance over time as more data
becomes available.

You might also like