SlideShare a Scribd company logo
PREDICTING EMPLOYEE ATTRITION
Predicting Employee Attrition
1.1 OBJECTIVE AND SCOPE OF THE STUDY
 The objective of this project is to predict the attrition rate for
each employee, to find out who’s more likely to leave the
organization.
 It will help organizations to find ways to prevent attrition or
to plan in advance the hiring of new candidate.
 Attrition proves to be a costly and time consuming problem
for the organization and it also leads to loss of productivity.
 The scope of the project extends to companies in all
industries.
1.2 ANALYTICS APPROACH
 Check for missing values in the data, and if any, will process
the data accordingly.
 Understand how the features are related with our target
variable - attrition
 Convert target variable into numeric form
 Apply feature selection and feature engineering to make it
model ready
 Apply various algorithms to check which one is the most
suitable
 Draw out recommendations based on our analysis.
1.3 DATA SOURCES
 For this project, an HR dataset named ‘IBM HR Analytics
Employee Attrition & Performance’, has been picked, which
is available on IBM website.
 The data contains records of 1,470 employees.
 It has information about employee’s current employment
status, the total number of companies worked for in the past,
Total number of years at the current company and the current
roles, Their education level, distance from home, monthly
income, etc.
1.4 TOOLS AND TECHNIQUES
 We have selected Python as our analytics tool.
 Python includes many packages such as Pandas, NumPy,
Matplotlib, Seaborn etc.
 Algorithms such as Logistic Regression, Random Forest,
Support Vector Machine and XGBoost have been used for
prediction.
Predicting Employee Attrition
 Importing Libraries
2.1 IMPORTING LIBRARY AND DATA EXTRACTION
 Importing Packages
 Data Extraction
2.2 EXPLORATORY DATA ANALYSIS
 Refers to the process of performing initial investigations on the
data so as to discover patterns, to spot inconsistencies, to test
hypothesis and to check assumptions with the help of graphical
representations
 Displaying First 5 Rows
 Displaying rows and columns
 Identifying Missing Values
 Count of “Yes” and “No” values of Attrition
2.3 VISUALIZATION(EDA) -
 Attrition V/s “Age”
 Attrition V/s “Distance from Home”
 Attrition V/s “Job Satisfaction”
 Attrition V/s “Performance Rating”
 Attrition V/s “Training Times Last Year”
 Attrition V/s “Work Life Balance”
 Attrition V/s “Years At Company”
 Attrition V/s “Years in Current Role”
 Attrition V/s “Years Since Last Promotion”
 Attrition V/s Categorical Variables
Attrition V/s “Gender, Marital status and Overtime”
Attrition V/s “Department, Job Role, and Business Travel”
Predicting Employee Attrition
Data Pre-Processing-
Steps Involved –
 Taking care of missing data and dropping non-relevant
features
 Feature extraction
 Converting categorical features into numeric form
Binarization of the converted categorical features
 Feature scaling
 Understanding correlation of features with each other
 Splitting data into training and test data sets
 Refers to data mining technique that transforms raw data into
an understandable format
 Useful in making the data ready for analysis
3.1 FEATURE SELECTION
 Process wherein those features are selected, which contribute
most to the prediction variable or output.
Benefits of feature selection :
 Improve the performance
 Improves Accuracy
 Providing the better understanding of Data
Dropping non-relevant variables
#dropping all fixed and non-relevant variables
attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month
lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi
mesLastYear'], axis=1,inplace=True)
Check number of rows and columns
Features Extraction
3.2 FEATURE ENGINEERING
Label Encoding
 Label Encoding refers to converting the categorical variables into numeric
form, so as to convert it into the machine-readable form.
 It is an important pre-processing step for the structured dataset in supervised
learning.
 Fit and transform the required columns of the data, and then replace the
existing text data with the new encoded data.
Convert categorical variables into numeric variables
 One Hot Encoder
 It is used to perform “binarization” of the categorical features and
include it as a feature to train the model.
 It takes a column which has categorical data that has been label
encoded, and then splits the column into multiple columns.
 The numbers are replaced by 1s and 0s, depending on which
column has what value.
Applying “One Hot Encoder” on Label Encoded features
Feature Scaling
 Feature scaling is a method used to standardize the range of
independent variables or features of data
 It is also known as Data Normalization
 It is used to scale the features to a range which is centred around
zero so that the variance of the features are in the same range
 Two most popular methods of feature scaling are standardization
and normalization
Scaling the features
Correlation Matrix
• Correlation is a statistical technique which determines how one
variables moves/changes in relation with the other variable.
• It’s a bi-variant analysis measure which describes the association
between different variables.
Usefulness of Correlation matrix –
 If two variables are closely correlated, then we can predict one
variable from the other.
 Correlation plays a vital role in locating the important variables
on which other variables depend.
 It is used as the foundation for various modeling techniques.
 Proper correlation analysis leads to better understanding of data.
Plotting correlation matrix
Correlation matrix Plot
Splitting data into train and test
Predicting Employee Attrition
 The process of modeling means training a machine learning
algorithm to predict the labels from the features, tuning it for
the business need, and validating it on holdout data.
 Models used for employee attrition:
 Logistic Regression
 Random Forest
 Support vector machine
 XG Boost
Model building -
4.1 LOGISTIC REGRESSION
 Logistic Regression is one of the most basic and widely used
machine learning algorithms for solving a classification problem.
 It is a method used to predict a dependent variable (Y), given an
independent variable (X), given that the dependent variable
is categorical.
 Linear Regression equation
 Y stands for the dependent variable that needs to be predicted.
 β0 is the Y-intercept, which is basically the point on the line which
touches the y-axis.
 β1 is the slope of the line (the slope can be negative or positive
depending on the relationship between the dependent variable and
the independent variable.)
 X here represents the independent variable that is used to predict
our resultant dependent value.
 ∈ denotes the error in the computation
 Sigmoid Function
p(x)= β0+ β1x
 Building Logistic Regression Model
 Testing the Model
 Confusion Matrix
 Confusion matrix is the most crucial metric commonly used to
evaluate classification models.
 The confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format.
In table above, Positive class = 1 and Negative class = 0.
Standard table of confusion matrix -
 Creating confusion matrix
 AUC score
 Receiver Operator Characteristic (ROC)
 ROC determines the accuracy of a classification model at a user
defined threshold value.
 It determines the model's accuracy using Area Under Curve
(AUC).
 The area under the curve (AUC), also referred to as index of
accuracy (A) or concordant index, represents the performance of
the ROC curve. Higher the area, better the model.
 Plotting ROC curve
 ROC Curve For Logistic Regression
Using Logistic Regression algorithm, we got the accuracy score of
79% and roc_auc score of 0.77
4.2 RANDOM FOREST
• Random Forest is a supervised learning algorithm.
• It creates a forest and makes it random based on bagging
technique. It aggregates Classification Trees.
• In Random Forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node.
 Building Random Forest Model
 Testing the Model
 Confusion Matrix
 AUC score
 Plotting ROC curve
Using Random Forest algorithm, we got the accuracy score of 79%
and roc_auc score of 0.76.
 ROC Curve For Random Forest
4.3 SUPPORT VECTOR MACHINE
 SVM is a supervised machine learning algorithm used for both
regression and classification problems.
 Objective is to find a hyperplane in an N -dimensional space.
 Hyperplanes
 Hyperplanes are decision boundaries
that help segregate the data points.
 The dimension of the hyperplane
depends upon the number of features.
 Support Vectors
 These are data points that are closest to the hyperplane and
influence the position and orientation of the hyperplane.
 Used to maximize the margin of the classifier.
 Considered as critical elements of a dataset
 Kernel Technique
 Used when non-linear hyperplanes are needed
 The hyperplane is no longer a line, it must now be a plane
 Since we have a non-linear
classification problem, kernel
technique used here is Radial Basis
Function (rbf)
 Helps in segregating data that are
linearly non-separable.
 Building SVM Model
 Testing SVM Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using SVM algorithm, we got the accuracy score of 79% and
roc_auc score of 0.77
 ROC Curve For SVM
4.4 XG BOOST
 XGBoost is a decision-tree-based ensemble Machine Learning algorithm
that uses a gradient boosting framework.
 XGBoost belongs to a family of boosting algorithms that convert weak
learners into strong learners.
 It is a sequential process, i.e., trees are grown using the information from
a previously grown tree one after the other, iteratively, the errors of the
previous model are corrected by the next predictor.
 Advantages of XGBoost -
 Regularization
 Parallel Processing
 High Flexibility
 Handling Missing Values
 Tree Pruning
 Built-in Cross-Validation
 Building XGBoost Model
 Testing the Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using XGBoost algorithm we got the accuracy score of 82% and
roc_auc score 0.81
 ROC Curve For XGBoost Model
4.5 COMPARISON OF MODELS
 It can be observed by the table that XGBoost outperforms all other models.
 Hence, based on these results we can conclude that, XGBoost will be the best
model to predict future Employee Attrition for this company.
Predicting Employee Attrition
KEY FINDINGS
 The dataset does not feature any missing values or any redundant
features.
 The strongest positive correlations with the target features are:
Distance from home, Job satisfaction, marital status, overtime and
business travel
 The strongest negative correlations with the target features are:
Performance Rating and Training times last year
Predicting Employee Attrition
RECOMMENDATIONS
 Transportation should be provided to employees living in the same
area, or else transportation allowance should be provided.
 Plan and allocate projects in such a way to avoid the use of
overtime.
 Employees who hit their two-year anniversary should be identified
as potentially having a higher-risk of leaving.
 Gather information on industry benchmarks to determine if the
company is providing competitive wages.
THANK YOU
Ad

More Related Content

What's hot (20)

Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
Mohamad Sahil
 
ATTRITION ppt
ATTRITION pptATTRITION ppt
ATTRITION ppt
piya chauhan
 
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUESEMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
IAEME Publication
 
Machine Learning Approach for Employee Attrition Analysis
Machine Learning Approach for Employee Attrition AnalysisMachine Learning Approach for Employee Attrition Analysis
Machine Learning Approach for Employee Attrition Analysis
ijtsrd
 
Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.
GK Sinha
 
Hr analytics
Hr analyticsHr analytics
Hr analytics
Anjali Das V.M
 
Project report on attrition analysis
Project report on attrition analysis Project report on attrition analysis
Project report on attrition analysis
mohanapriya301
 
Hr analytics
Hr analyticsHr analytics
Hr analytics
Shubham Singhal
 
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Sri Ambati
 
1345 keynote roberts
1345 keynote roberts1345 keynote roberts
1345 keynote roberts
Rising Media, Inc.
 
HR / Talent Analytics
HR / Talent AnalyticsHR / Talent Analytics
HR / Talent Analytics
Akshay Raje
 
Analytics in Training & Development and ROI in T & D
Analytics in Training & Development and ROI in T & DAnalytics in Training & Development and ROI in T & D
Analytics in Training & Development and ROI in T & D
Dr. Nilesh Thakre
 
Telecom Churn Prediction Presentation
Telecom Churn Prediction PresentationTelecom Churn Prediction Presentation
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
HR Information System
HR Information SystemHR Information System
HR Information System
Azad Khan
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Human Resource Planning Process
Human Resource Planning Process Human Resource Planning Process
Human Resource Planning Process
Dr. Asma Qureshi
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Analytics in hr
Analytics in hrAnalytics in hr
Analytics in hr
sonalimadhusmitajena1
 
MBA HR Project Topics
MBA HR Project TopicsMBA HR Project Topics
MBA HR Project Topics
MBA Dissertation Help
 
HR Analytics Design, Implementation and Measurement of HR Strategy
HR Analytics Design, Implementation and Measurement of HR StrategyHR Analytics Design, Implementation and Measurement of HR Strategy
HR Analytics Design, Implementation and Measurement of HR Strategy
Dr. Nilesh Thakre
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
Mohamad Sahil
 
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUESEMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
IAEME Publication
 
Machine Learning Approach for Employee Attrition Analysis
Machine Learning Approach for Employee Attrition AnalysisMachine Learning Approach for Employee Attrition Analysis
Machine Learning Approach for Employee Attrition Analysis
ijtsrd
 
Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.
GK Sinha
 
Project report on attrition analysis
Project report on attrition analysis Project report on attrition analysis
Project report on attrition analysis
mohanapriya301
 
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Sri Ambati
 
HR / Talent Analytics
HR / Talent AnalyticsHR / Talent Analytics
HR / Talent Analytics
Akshay Raje
 
Analytics in Training & Development and ROI in T & D
Analytics in Training & Development and ROI in T & DAnalytics in Training & Development and ROI in T & D
Analytics in Training & Development and ROI in T & D
Dr. Nilesh Thakre
 
Telecom Churn Prediction Presentation
Telecom Churn Prediction PresentationTelecom Churn Prediction Presentation
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
HR Information System
HR Information SystemHR Information System
HR Information System
Azad Khan
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Human Resource Planning Process
Human Resource Planning Process Human Resource Planning Process
Human Resource Planning Process
Dr. Asma Qureshi
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
HR Analytics Design, Implementation and Measurement of HR Strategy
HR Analytics Design, Implementation and Measurement of HR StrategyHR Analytics Design, Implementation and Measurement of HR Strategy
HR Analytics Design, Implementation and Measurement of HR Strategy
Dr. Nilesh Thakre
 

Similar to Predicting Employee Attrition (20)

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
Hussain395748
 
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Employee Retention Prediction: A Data Science Project by Devangi ShuklaEmployee Retention Prediction: A Data Science Project by Devangi Shukla
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Boston Institute of Analytics
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
Ashish Patel
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET Journal
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
Krishna Mohan Mishra
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
IJCI JOURNAL
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
Codemotion
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
IRJET Journal
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solutionDA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
gitikasingh2004
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
IRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
maxbox_starter138_top7_statistical_methods.pdf
maxbox_starter138_top7_statistical_methods.pdfmaxbox_starter138_top7_statistical_methods.pdf
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
Kaviya452563
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Employee Retention Prediction: A Data Science Project by Devangi ShuklaEmployee Retention Prediction: A Data Science Project by Devangi Shukla
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Boston Institute of Analytics
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
Ashish Patel
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET Journal
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
IJCI JOURNAL
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
Codemotion
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
IRJET Journal
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solutionDA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
gitikasingh2004
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
IRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
maxbox_starter138_top7_statistical_methods.pdf
maxbox_starter138_top7_statistical_methods.pdfmaxbox_starter138_top7_statistical_methods.pdf
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
Kaviya452563
 
Ad

Recently uploaded (20)

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Ad

Predicting Employee Attrition

  • 3. 1.1 OBJECTIVE AND SCOPE OF THE STUDY  The objective of this project is to predict the attrition rate for each employee, to find out who’s more likely to leave the organization.  It will help organizations to find ways to prevent attrition or to plan in advance the hiring of new candidate.  Attrition proves to be a costly and time consuming problem for the organization and it also leads to loss of productivity.  The scope of the project extends to companies in all industries.
  • 4. 1.2 ANALYTICS APPROACH  Check for missing values in the data, and if any, will process the data accordingly.  Understand how the features are related with our target variable - attrition  Convert target variable into numeric form  Apply feature selection and feature engineering to make it model ready  Apply various algorithms to check which one is the most suitable  Draw out recommendations based on our analysis.
  • 5. 1.3 DATA SOURCES  For this project, an HR dataset named ‘IBM HR Analytics Employee Attrition & Performance’, has been picked, which is available on IBM website.  The data contains records of 1,470 employees.  It has information about employee’s current employment status, the total number of companies worked for in the past, Total number of years at the current company and the current roles, Their education level, distance from home, monthly income, etc.
  • 6. 1.4 TOOLS AND TECHNIQUES  We have selected Python as our analytics tool.  Python includes many packages such as Pandas, NumPy, Matplotlib, Seaborn etc.  Algorithms such as Logistic Regression, Random Forest, Support Vector Machine and XGBoost have been used for prediction.
  • 8.  Importing Libraries 2.1 IMPORTING LIBRARY AND DATA EXTRACTION
  • 9.  Importing Packages  Data Extraction
  • 10. 2.2 EXPLORATORY DATA ANALYSIS  Refers to the process of performing initial investigations on the data so as to discover patterns, to spot inconsistencies, to test hypothesis and to check assumptions with the help of graphical representations  Displaying First 5 Rows
  • 11.  Displaying rows and columns
  • 13.  Count of “Yes” and “No” values of Attrition
  • 14. 2.3 VISUALIZATION(EDA) -  Attrition V/s “Age”
  • 15.  Attrition V/s “Distance from Home”
  • 16.  Attrition V/s “Job Satisfaction”
  • 17.  Attrition V/s “Performance Rating”
  • 18.  Attrition V/s “Training Times Last Year”
  • 19.  Attrition V/s “Work Life Balance”
  • 20.  Attrition V/s “Years At Company”
  • 21.  Attrition V/s “Years in Current Role”
  • 22.  Attrition V/s “Years Since Last Promotion”
  • 23.  Attrition V/s Categorical Variables
  • 24. Attrition V/s “Gender, Marital status and Overtime”
  • 25. Attrition V/s “Department, Job Role, and Business Travel”
  • 27. Data Pre-Processing- Steps Involved –  Taking care of missing data and dropping non-relevant features  Feature extraction  Converting categorical features into numeric form Binarization of the converted categorical features  Feature scaling  Understanding correlation of features with each other  Splitting data into training and test data sets  Refers to data mining technique that transforms raw data into an understandable format  Useful in making the data ready for analysis
  • 28. 3.1 FEATURE SELECTION  Process wherein those features are selected, which contribute most to the prediction variable or output. Benefits of feature selection :  Improve the performance  Improves Accuracy  Providing the better understanding of Data
  • 29. Dropping non-relevant variables #dropping all fixed and non-relevant variables attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi mesLastYear'], axis=1,inplace=True) Check number of rows and columns
  • 31. Label Encoding  Label Encoding refers to converting the categorical variables into numeric form, so as to convert it into the machine-readable form.  It is an important pre-processing step for the structured dataset in supervised learning.  Fit and transform the required columns of the data, and then replace the existing text data with the new encoded data.
  • 32. Convert categorical variables into numeric variables
  • 33.  One Hot Encoder  It is used to perform “binarization” of the categorical features and include it as a feature to train the model.  It takes a column which has categorical data that has been label encoded, and then splits the column into multiple columns.  The numbers are replaced by 1s and 0s, depending on which column has what value.
  • 34. Applying “One Hot Encoder” on Label Encoded features
  • 35. Feature Scaling  Feature scaling is a method used to standardize the range of independent variables or features of data  It is also known as Data Normalization  It is used to scale the features to a range which is centred around zero so that the variance of the features are in the same range  Two most popular methods of feature scaling are standardization and normalization
  • 37. Correlation Matrix • Correlation is a statistical technique which determines how one variables moves/changes in relation with the other variable. • It’s a bi-variant analysis measure which describes the association between different variables. Usefulness of Correlation matrix –  If two variables are closely correlated, then we can predict one variable from the other.  Correlation plays a vital role in locating the important variables on which other variables depend.  It is used as the foundation for various modeling techniques.  Proper correlation analysis leads to better understanding of data.
  • 40. Splitting data into train and test
  • 42.  The process of modeling means training a machine learning algorithm to predict the labels from the features, tuning it for the business need, and validating it on holdout data.  Models used for employee attrition:  Logistic Regression  Random Forest  Support vector machine  XG Boost Model building -
  • 43. 4.1 LOGISTIC REGRESSION  Logistic Regression is one of the most basic and widely used machine learning algorithms for solving a classification problem.  It is a method used to predict a dependent variable (Y), given an independent variable (X), given that the dependent variable is categorical.
  • 44.  Linear Regression equation  Y stands for the dependent variable that needs to be predicted.  β0 is the Y-intercept, which is basically the point on the line which touches the y-axis.  β1 is the slope of the line (the slope can be negative or positive depending on the relationship between the dependent variable and the independent variable.)  X here represents the independent variable that is used to predict our resultant dependent value.  ∈ denotes the error in the computation
  • 46.  Building Logistic Regression Model
  • 48.  Confusion Matrix  Confusion matrix is the most crucial metric commonly used to evaluate classification models.  The confusion matrix avoids "confusion" by measuring the actual and predicted values in a tabular format. In table above, Positive class = 1 and Negative class = 0. Standard table of confusion matrix -
  • 49.  Creating confusion matrix  AUC score
  • 50.  Receiver Operator Characteristic (ROC)  ROC determines the accuracy of a classification model at a user defined threshold value.  It determines the model's accuracy using Area Under Curve (AUC).  The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model.
  • 52.  ROC Curve For Logistic Regression Using Logistic Regression algorithm, we got the accuracy score of 79% and roc_auc score of 0.77
  • 53. 4.2 RANDOM FOREST • Random Forest is a supervised learning algorithm. • It creates a forest and makes it random based on bagging technique. It aggregates Classification Trees. • In Random Forest, only a random subset of the features is taken into consideration by the algorithm for splitting a node.
  • 54.  Building Random Forest Model
  • 55.  Testing the Model  Confusion Matrix
  • 56.  AUC score  Plotting ROC curve
  • 57. Using Random Forest algorithm, we got the accuracy score of 79% and roc_auc score of 0.76.  ROC Curve For Random Forest
  • 58. 4.3 SUPPORT VECTOR MACHINE  SVM is a supervised machine learning algorithm used for both regression and classification problems.  Objective is to find a hyperplane in an N -dimensional space.  Hyperplanes  Hyperplanes are decision boundaries that help segregate the data points.  The dimension of the hyperplane depends upon the number of features.
  • 59.  Support Vectors  These are data points that are closest to the hyperplane and influence the position and orientation of the hyperplane.  Used to maximize the margin of the classifier.  Considered as critical elements of a dataset
  • 60.  Kernel Technique  Used when non-linear hyperplanes are needed  The hyperplane is no longer a line, it must now be a plane  Since we have a non-linear classification problem, kernel technique used here is Radial Basis Function (rbf)  Helps in segregating data that are linearly non-separable.
  • 62.  Testing SVM Model  Confusion Matrix
  • 63.  AUC Score  Plotting ROC Curve
  • 64. Using SVM algorithm, we got the accuracy score of 79% and roc_auc score of 0.77  ROC Curve For SVM
  • 65. 4.4 XG BOOST  XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.  XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners.  It is a sequential process, i.e., trees are grown using the information from a previously grown tree one after the other, iteratively, the errors of the previous model are corrected by the next predictor.  Advantages of XGBoost -  Regularization  Parallel Processing  High Flexibility  Handling Missing Values  Tree Pruning  Built-in Cross-Validation
  • 67.  Testing the Model  Confusion Matrix
  • 68.  AUC Score  Plotting ROC Curve
  • 69. Using XGBoost algorithm we got the accuracy score of 82% and roc_auc score 0.81  ROC Curve For XGBoost Model
  • 70. 4.5 COMPARISON OF MODELS  It can be observed by the table that XGBoost outperforms all other models.  Hence, based on these results we can conclude that, XGBoost will be the best model to predict future Employee Attrition for this company.
  • 72. KEY FINDINGS  The dataset does not feature any missing values or any redundant features.  The strongest positive correlations with the target features are: Distance from home, Job satisfaction, marital status, overtime and business travel  The strongest negative correlations with the target features are: Performance Rating and Training times last year
  • 74. RECOMMENDATIONS  Transportation should be provided to employees living in the same area, or else transportation allowance should be provided.  Plan and allocate projects in such a way to avoid the use of overtime.  Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.  Gather information on industry benchmarks to determine if the company is providing competitive wages.