0% found this document useful (0 votes)

3 views22 pages

PM Project Logistic Regression LDA.docx

The document outlines a project involving data analysis and modeling to predict employee interest in holiday packages based on a dataset of 872 employees. It details the steps taken, including data ingestion, exploratory data analysis, and the application of Logistic Regression and Linear Discriminant Analysis (LDA) for modeling. The evaluation of model performance indicates that Logistic Regression outperforms LDA slightly, with both models achieving around 65-68% accuracy, leading to business insights and recommendations for the tour and travel agency.

Uploaded by

subhadeepseal1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views22 pages

PM Project Logistic Regression LDA.docx

Uploaded by

subhadeepseal1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

S. No. Item Page No.

1 Problem Statement 2
Q1: Data Ingestion: Read the dataset. Do the descriptive statistics
2 and do null value condition check, write an inference on it. Perform 2
Univariate and Bivariate Analysis. Do exploratory data analysis.
3 A1: Dataset Sample, Info, Descriptive Statistics and other EDA 2-9
Q2: Do not scale the data. Encode the data (having string values)
4 for Modelling. Data Split: Split the data into train and test (70:30). 10
Apply Logistic Regression and LDA (linear discriminant analysis).
5 A2: Response to Question 2 10-12
Q3: Q2.3 Performance Metrics: Check the performance of
Predictions on Train and Test sets using Accuracy, Confusion
6 Matrix, Plot ROC curve and get ROC_AUC score for each model 12
Final Model: Compare Both the models and write inference which
model is best/optimized..
7 A3: Model performance measures and model comparison 12-18
Q4: Inference: Basis on these predictions, what are the business
8 19
insights and recommendations.
9 A4: Final insights and business recommendations 19
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.

Attributes:

Q2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.

Ans 2.1:

Sample of the dataset:

 The dataset relates to various attributes of 872 employees of a company along with
information on which employee bought or did not buy the Holiday package from the tour
company
 The data has a column called “Unnamed: 0” which seems like S.No and is probably not of
any relevance for analysis. Hence, this column will be dropped from analysis in future tables
 Other than that, the data has 7 variables:
o Holiday Package and Foreign seem to be categorical variables
o Salary, age, education, no of young children, no of older children appear to be numerical variables.
 Holiday Package is the target variable and rest all are the independent variables
We can confirm the datatypes (as mentioned above) of the variables in the dataset using the “info”
function (from pandas package in python). The function also provides additional info on whether the
are any null values in any of the features as well as the total number of rows and columns in the data

The above table shows:

 The datatype of the variables is same as mentioned in the previous section

o Holiday Package and Foreign are categorical variables
o Salary, age, education, no of young children, no of older children are numerical variables

 None of the features have missing values

 The data has 872 rows and 7 columns
 The data has also been checked for duplicate values and there are no duplicate values in
the data

Using methods of descriptive statistics to describe the data

Coefficient of Variation of numerical variables:

Observations from descriptive statistics:

 Mean of “Salary” and “No of young children” variables is more than the median, indicating
right skewness in the data. We’ll see the magnitude of skewness in the univariate analysis
section

 For all the other numerical variables (age, education, no of older children), mean and median
values are more or less same indicating that the data could be normally distributed. We will
confirm this through univariate analysis

 Coefficient of Variation of Salary, Age and Education is less than 1, indicating that the data is
centered around mean and the skewness in data is less

 Coefficient of Variation of # of young children and # of older children is more than 1 indicating
some skewness in the data

 There are 2 sub-classes in Holiday Package and Foreign features. More information on the
distribution/frequency of sub-classes within each variable will be covered in the univariate
analysis section.

 Mean/Median of:

o age of the employees is 39 years

o # of young children is 0, and

o # of older children is 1

 Minimum value in Salary feature is 1322. This value seems to be very low. Based on the
table below, there is only one row which has this value and seems like an anomaly given the
age and the number of education years of the employee. Hence, we will exclude this row
from the analysis:

Checking the unique value of categorical variables

 Due to exclusion of above-mentioned row from the data, total number of row now are 871
 400 employees opted for Holiday package, while 471 did not opt for it
 216 employees are foreigners and 656 employees are locals
 The data has no missing values or anomalies
Univariate Analysis (Skewness Score, Histogram and Boxplot of continuous variables)

Salary

Age
Education

Univariate analysis of numerical variables shows that:

 Salary data is right skewed and has quite a few outliers on the higher side. But, these
outliers look genuine as this kind of dispersion in salary data in a company is normal where
majority of employees fall within a certain salary range, but certain employees at higher
levels (Manager and above) are paid relatively much higher. Hence, we will not be treating
these outliers
 Age data looks to be normally distributed with no outliers
 Education data also looks fairly normally distributed with very few outliers, but the outlier
values look plausible. Hence, we will not be treating the outliers in this case also

Countplot of the remaining features

Below charts show the count of observations under each sub class of the categorical dimensions.
Observations:

1) Majority of the employees have 0 children, both in the “no of young children” and “no of
older children” features

2) In the “no of older children” feature, employees that have children, most have either 1 or
2 children. There are very few employees with 3 or more children

3) Majority of the employees are locals, while there are a few foreigners
Bivariate Analysis

Boxplots of Salary, Age and Education with split by “Holiday Package” buyers

Observations:

1) There does not seem to be significant difference in the median Salary, age and
education of employee groups who bought the package vs those who did not

2) In all the three cases, the median Salary, Age and Education of employees who did not
buy the package is marginally higher than that of employees who bought the package
Countplot of no. of young children, no. of older children and Foreign with split by “Holiday
Package” buyers

Observations:

1) Number of employees buying the package is slightly more for two groups:

a. Those with 0 # of young children

b. Those with 2, 3 or 6 older children

2) For all the other sub-classes within both the features “no of young children” and “no of
older children”, the number of employees NOT buying the package is relatively more

3) Amongst local employees, buyers of package are less, while the number of buyers
amongst foreigners is more
Pair plot

Correlation Heat Map

Bivariate analysis shows some degree of correlation between:

- Positive correlation between Education and salary
- Negative correlation between No of young children and age
- Positive correlation between no of young children and no of older children

There seems to be little or no correlation between the remaining variables

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).

Before splitting the data and fitting into the model, firstly all the categorical variables “Holliday
Package” and “Foreign” need to be converted into numerical values, since models like Logistic
Regression and Linear Discriminant Analysis can only take numerical values.

Upon conversion of categorical variables into numerical variables, the dataframe looks as below:

- 0 value in foreign_yes column indicate the employee is local

- 0 value in Holiday_Package indicates non-buyer and 1 indicates buyer of the Holiday

Package

The next step is to extract the target column into a separate vector for the training as well as the test
set. Post the extraction step, independent variables are stored in the ‘X’ dataframe and the Target
column is stored in the ‘y’ dataframe

Independent Variables Dataset Sample

Dependent Variables Dataset Sample

Both X (independent variables) and y (target variable) datasets will now be split into Test and Train
data (using the “train_test_split” function from the “sklearn” package in python) in a 70:30 proportion,
meaning the training dataset will have 70% data from the full dataset and the test data will have the
remaining 30% data. Post executing the above function, we get the following test and training sets
for X and y datasets

 X_train has 609 rows and 6 columns

 X_test has 262 rows and 6 columns
 y_train has 609 rows and 1 column
 y_test has 262 rows and 1 column

Distribution of target variable in classes 0 and 1

Training Data

Test Data

The distribution of target variable in classes 0 and 1 in the training and test data is consistent and it
is also in the same proportion as was in the full dataset before the split

Building the Logistic Regression model

Logistic Regression is defined as a statistical approach, for calculating the probability outputs for the
target labels. In its basic form it is used to classify binary data. Logistic regression is very much
similar to linear regression where the explanatory variables(X) are combined with weights to predict
a target variable of binary class(y). The main difference between linear regression and logistic
regression is of the type of the target variable, which in case of the latter is categorical

The model is built using the following parameters:

GridSearch function has been used to select the final tuning parameters from a range of values for
each parameter. Multiple values for each parameter were tested before arriving at the final best
parameter as shown below:
Building the Linear Discriminant Analysis (LDA) model

LDA and logistic regression are both multivariate statistical methods which are used to determine
relationships between different independent variables to the categorical dependent variable.

While, Logistic Regression has been explained above, In LDA, the orthogonal (perpendicular to each
other) discriminant functions are estimated such that it maximizes the difference of means between
the existing groups (class labels) while minimizing the standard deviation within the groups. Thus,
the predicted class for a data point will be the one that has the highest value for its corresponding
linear function.

In python, LDA model can be fit using the LinearDiscriminantAnalysis() function from scikit learn
package as below:

Importing the pakage and splitting the data into Test and Train sets in 70:30 ratio

Fitting the Training data into the model

Q2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is best/optimized.

Ans 2.3 Evaluating Model Performance

Once, the model is fit, we will look at a few model performance measures for both the models to see
how good or bad the models are for making predictions. Key measures and terminologies used in
evaluating model performance include:
True Positives (TP): Predicted by model as Yes and are actually also No

True Negatives (TN): Predicted by model as No and are actually also No

False Positives (FP): Predicted by Model as Yes but are actually No

False Negative (FN): Predicted by Model as No but are actually Yes

The above four metrics are included in Confusion Matrix, from which the following ratios are
calculated to assess the accuracy of Predictions. The output of the below ratios can be seen in the
Classification report and the ROC plot. Key evaluation measures are:

Accuracy: Measures how often is the model correct. Calculated as:

(TP+ TN)/Total observations

Sensitivity or Recall: When it's actually yes, how often does it predict yes. Calculated as:

TP/(TP+FN) (Also known True Postivity Rate (TPR))

Precision: Among the points identified as Positive by the model, how many are really Positive

TP/(TP+FP)

Specificity: How many of the actual Negative data points are identified as negative by the model

TN/(TN/FP)

F-Score: Harmonic mean of the Recall and Precision

2* precision*recall /(precision + recall)

ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all
possible thresholds. It is generated by plotting the True Positive Rate (TP/Total Actual Positives) on
the y-axis against the False Positive Rate (FP/Total Actual Negatives) on the x-axis

AUC: AUC is an abbrevation for area under the curve in ROC curves. The closer AUC for a model
comes to 1, the better it is. So models with higher AUCs are preferred over those with lower AUCs.

Now we will evaluate the performance of each model using the above measures

Model 1: Logistic Regression

1.1 Accuracy Score

Training Data: 0.66

Test Data: 0.68
1.2 Confusion Matrix

Training Data:

Test Data:

1.3 Classification Reports:

Training Data:

Test Data:
1.4 AUC Score and AUC_ROC plots

Training Data:

Test Data:

Overall Summary of the Key Measures of Logistic Regression Model

Train Data:
AUC: 0.73
Accuracy: 0. 66
F1 Score: 0.59
Precision: 0.65

Test Data:
AUC: 0.73
Accuracy: 0.68
F1 Score: 0.62
Precision: 0.67

Observations

 Training and Test set results are almost similar

 The Model’s accuracy is slightly above 65%
 The scores across the other metrices are also moderately good and fairly consistent across
the train and test data
Model 2: LDA

2.1 Accuracy Score

Training Data: 0.65

Test Data: 0.65

2.2 Confusion Matrix

Training Data:

Test Data:

2.3 Classification Reports:

Training Data:
Test Data:

2.4 AUC Score and AUC_ROC plots

Training Data:

Test Data:
Overall Summary of the Key Measures of LDA Model

Train Data:
AUC: 0.73
Accuracy: 0. 65
F1 Score: 0.59
Precision: 0.64

Test Data:
AUC: 0.73
Accuracy: 0.66
F1 Score: 0.60
Precision: 0.66

Observations

 The model’s accuracy is around 65%

 The scores across the other metrices are moderately good and fairly consistent across the
test and training data

Comparison of the key performance measures across the two models

Logistic Regression LDA

Observation:

Upon comparing the key metrics across the two models, Logistic Regression seems marginally
better than the LDA model as its scores on Accuracy, F1 and Precision are slightly better than LDA.
AUC scores are same across the two models
Q2.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Ans 2.4

Based on the above analysis, results from the Logistic Regression model are a tad bit better than
LDA. Further, the model seems to be fairly consistent in prediction with about 65% accuracy. Hence,
the company can implement Logistic Regression Model to start with.

While, the model performance currently is not the best and there is definitely scope for further
improvement in the same, but it’s not very poor either and it can still be implemented to make
predictions with about 65% accuracy about which employee might purchase the Holiday package or
not. Further, the company can try to improve the model performance over time as more data
becomes available.

FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Business Report: Advanced Statistics Project
100% (5)
Business Report: Advanced Statistics Project
24 pages
Business Report - Advanced Statistics - Great Learning
100% (1)
Business Report - Advanced Statistics - Great Learning
20 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Example of Two Group Discriminant Analysis
No ratings yet
Example of Two Group Discriminant Analysis
7 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Armillia Karenna - TP060327 - Pfda
No ratings yet
Armillia Karenna - TP060327 - Pfda
65 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
Advanced Statistics - Project - 16052021
100% (1)
Advanced Statistics - Project - 16052021
9 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Logistic Regression and Lda
75% (4)
Logistic Regression and Lda
27 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
ML 2 Project Business Report_Nandini
No ratings yet
ML 2 Project Business Report_Nandini
43 pages
Advanced Statistics (AS) Project Report
No ratings yet
Advanced Statistics (AS) Project Report
52 pages
Travel Agency Package
No ratings yet
Travel Agency Package
26 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
SMDM Project Report - Shubham Bakshi - 07.05.2023
0% (1)
SMDM Project Report - Shubham Bakshi - 07.05.2023
23 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Advanced Statistics Project Report Final
No ratings yet
Advanced Statistics Project Report Final
40 pages
AV Project Shivakumar Vanga
No ratings yet
AV Project Shivakumar Vanga
36 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
Project Submission Predictive Modelling - Logistic Regression and LDA
No ratings yet
Project Submission Predictive Modelling - Logistic Regression and LDA
29 pages
Advanced Statistics Assignment: Business Report (PGP - DSBA)
No ratings yet
Advanced Statistics Assignment: Business Report (PGP - DSBA)
23 pages
ASProject-Padma Murali
No ratings yet
ASProject-Padma Murali
45 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
No ratings yet
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
21 pages
Account Based Analytics Final Spring 2025
No ratings yet
Account Based Analytics Final Spring 2025
2 pages
SMDM Coded Project - Vidya Sawant
No ratings yet
SMDM Coded Project - Vidya Sawant
27 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
Stati Mannarkkad (1)
No ratings yet
Stati Mannarkkad (1)
11 pages
Car Crash Survival Prediction Business Report
No ratings yet
Car Crash Survival Prediction Business Report
23 pages
Exam PA June 18, 2020 Project Solution: Task 1 - Explore The Data (8 Points)
No ratings yet
Exam PA June 18, 2020 Project Solution: Task 1 - Explore The Data (8 Points)
20 pages
Exercise Univariate Analysis - Andoni Fikri - 13118111
No ratings yet
Exercise Univariate Analysis - Andoni Fikri - 13118111
9 pages
Business Report Suchita Bhovar Coded Project
No ratings yet
Business Report Suchita Bhovar Coded Project
18 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
2503023TaniyadubeyMRA M1Project
No ratings yet
2503023TaniyadubeyMRA M1Project
48 pages
MRA Extended Project Business Report(PPT)
No ratings yet
MRA Extended Project Business Report(PPT)
29 pages
Report
No ratings yet
Report
15 pages
Unit 4 Atacks
No ratings yet
Unit 4 Atacks
28 pages
Load Combinations EN 1990
No ratings yet
Load Combinations EN 1990
11 pages
Download Complete Student solutions manual for Calculus of a single variable by Ron Larson the Pennsylvania State University the Behrend College Bruce Edwards University of Florida Tenth Edition Edwards PDF for All Chapters
No ratings yet
Download Complete Student solutions manual for Calculus of a single variable by Ron Larson the Pennsylvania State University the Behrend College Bruce Edwards University of Florida Tenth Edition Edwards PDF for All Chapters
81 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
Logistic+Regression
No ratings yet
Logistic+Regression
3 pages
615a1665-Cf58-4a37-94ed-639cbd216175_Copy of Managing Data_BSU_Open Book Exam Mock Test_September 2023_Suggested Answers(5901)
No ratings yet
615a1665-Cf58-4a37-94ed-639cbd216175_Copy of Managing Data_BSU_Open Book Exam Mock Test_September 2023_Suggested Answers(5901)
34 pages
Analytics Group Assignment
No ratings yet
Analytics Group Assignment
16 pages
PG IV 1110 Online Predictive Modelling End Term Paper
No ratings yet
PG IV 1110 Online Predictive Modelling End Term Paper
3 pages
AP Physics 1 - Student Sample Questions Fall 2014
No ratings yet
AP Physics 1 - Student Sample Questions Fall 2014
62 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
HR A (6)
No ratings yet
HR A (6)
7 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
PFDA
No ratings yet
PFDA
23 pages
Business Report MRA Project
No ratings yet
Business Report MRA Project
48 pages
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
1 Final-Exam
No ratings yet
1 Final-Exam
6 pages
Year 7: Pythagoras' Theorem
0% (1)
Year 7: Pythagoras' Theorem
32 pages
Kinematics of Fluid Flow
No ratings yet
Kinematics of Fluid Flow
63 pages
BCA C Practical
No ratings yet
BCA C Practical
19 pages
Load Flow Analysis With Wind Farms
No ratings yet
Load Flow Analysis With Wind Farms
22 pages
!!new Words
No ratings yet
!!new Words
68 pages
Lecture 1+introduction
No ratings yet
Lecture 1+introduction
38 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
EMPLOYEE PERFORMANCE ANALYSIS
No ratings yet
EMPLOYEE PERFORMANCE ANALYSIS
3 pages
MNSMS Template
No ratings yet
MNSMS Template
9 pages
Machine Learning Based Power Estimation For CMOS V
No ratings yet
Machine Learning Based Power Estimation For CMOS V
12 pages
An Assessment of Risk Factors Involved I
No ratings yet
An Assessment of Risk Factors Involved I
11 pages
Exercise 6 Chaks Pure Mathematics
No ratings yet
Exercise 6 Chaks Pure Mathematics
2 pages
Indian Mathematics History
100% (3)
Indian Mathematics History
29 pages
SQL+CodeEval+ +Guide
No ratings yet
SQL+CodeEval+ +Guide
15 pages
Quarterly Business Report
No ratings yet
Quarterly Business Report
16 pages
SQL+Playground+ +Guide
No ratings yet
SQL+Playground+ +Guide
14 pages
Week+1+ Lecture+Slide+and+Notes
No ratings yet
Week+1+ Lecture+Slide+and+Notes
12 pages
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
Law of Cosines 13
No ratings yet
Law of Cosines 13
4 pages
Mac
No ratings yet
Mac
20 pages
MG University 6th Ece Full Syllabus
No ratings yet
MG University 6th Ece Full Syllabus
9 pages
Answers: Consolidation Exercise 11A
No ratings yet
Answers: Consolidation Exercise 11A
1 page
Random+Forest+Summary
No ratings yet
Random+Forest+Summary
6 pages
M40 Mixed Design (1)
No ratings yet
M40 Mixed Design (1)
4 pages
Python Uint II QB
No ratings yet
Python Uint II QB
14 pages
Further Statistics in Dentistry Part 1: Research Designs 1: Practice
No ratings yet
Further Statistics in Dentistry Part 1: Research Designs 1: Practice
4 pages
Fluent Gambit
No ratings yet
Fluent Gambit
39 pages
M25 mixed design with OPC
No ratings yet
M25 mixed design with OPC
4 pages
Answer Matrices SPM 2015-2019
No ratings yet
Answer Matrices SPM 2015-2019
2 pages
2 2005
100% (1)
2 2005
13 pages
MAD111 Review Chap 1 2 ENG
No ratings yet
MAD111 Review Chap 1 2 ENG
3 pages
Maths June 2021 P1
No ratings yet
Maths June 2021 P1
13 pages
Anna University Software Engineering Question Bank
No ratings yet
Anna University Software Engineering Question Bank
7 pages
Binary Search Tree: Basic Operations
No ratings yet
Binary Search Tree: Basic Operations
4 pages
Mathsmartboardlesson
No ratings yet
Mathsmartboardlesson
4 pages
Scientific Management of the Classroom
From Everand
Scientific Management of the Classroom
Pernell Hodges
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet

PM Project Logistic Regression LDA.docx

Uploaded by

PM Project Logistic Regression LDA.docx

Uploaded by

Table of Contents

S. No. Item Page No.

Sample of the dataset:

The above table shows:

 The datatype of the variables is same as mentioned in the previous section

 None of the features have missing values

Using methods of descriptive statistics to describe the data

Coefficient of Variation of numerical variables:

o age of the employees is 39 years

o # of young children is 0, and

Checking the unique value of categorical variables

Univariate analysis of numerical variables shows that:

Countplot of the remaining features

a. Those with 0 # of young children

b. Those with 2, 3 or 6 older children

Correlation Heat Map

Bivariate analysis shows some degree of correlation between:

There seems to be little or no correlation between the remaining variables

- 0 value in foreign_yes column indicate the employee is local

- 0 value in Holiday_Package indicates non-buyer and 1 indicates buyer of the Holiday

Independent Variables Dataset Sample

Dependent Variables Dataset Sample

 X_train has 609 rows and 6 columns

Distribution of target variable in classes 0 and 1

Building the Logistic Regression model

The model is built using the following parameters:

Fitting the Training data into the model

Ans 2.3 Evaluating Model Performance

True Negatives (TN): Predicted by model as No and are actually also No

False Positives (FP): Predicted by Model as Yes but are actually No

False Negative (FN): Predicted by Model as No but are actually Yes

Accuracy: Measures how often is the model correct. Calculated as:

(TP+ TN)/Total observations

TP/(TP+FN) (Also known True Postivity Rate (TPR))

F-Score: Harmonic mean of the Recall and Precision

2* precision*recall /(precision + recall)

Model 1: Logistic Regression

1.1 Accuracy Score

Training Data: 0.66

1.3 Classification Reports:

Overall Summary of the Key Measures of Logistic Regression Model

 Training and Test set results are almost similar

2.1 Accuracy Score

Training Data: 0.65

2.2 Confusion Matrix

2.3 Classification Reports:

2.4 AUC Score and AUC_ROC plots

 The model’s accuracy is around 65%

Comparison of the key performance measures across the two models

Logistic Regression LDA

You might also like