0% found this document useful (0 votes)
2 views

Research Paper (1)

The document discusses a project aimed at predicting employee attrition using various machine learning techniques, focusing on data collection, preprocessing, and model training. It highlights the use of the Random Forest algorithm, which achieved the highest accuracy of 85% compared to other models like decision trees and logistic regression. The findings emphasize the importance of feature selection and the potential for organizations to leverage these insights to improve employee retention and operational efficiency.

Uploaded by

optimusprim32023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Research Paper (1)

The document discusses a project aimed at predicting employee attrition using various machine learning techniques, focusing on data collection, preprocessing, and model training. It highlights the use of the Random Forest algorithm, which achieved the highest accuracy of 85% compared to other models like decision trees and logistic regression. The findings emphasize the importance of feature selection and the potential for organizations to leverage these insights to improve employee retention and operational efficiency.

Uploaded by

optimusprim32023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

PREDICTION OF EMPLOYEE ATTRITION

USING MACHINE LEARNING


N.Ganesh 1, T.Yamini2, P.Harika 3, M.Thrisha4 and Mr.B.RamaRao5
1,2,3,4 student of Department of Information Technology and Engineering,Narasaraopet Engineering
College 5 Faculty of Department of Information Technology and Engineering, Tirumala Engineering College,
Narasaraopet1naiduganesh20042gmail.com,[email protected],[email protected],4thrisha
[email protected]

Abstract- In today’s IT world, the major concern is


employee attrition rate. Attrition rate can be defined Third section will discuss on model training here we
as the percentage of employees who left from the pass more consistent data to different classification
organization. The aim of this project is to analyse a models.
particular employee will continue in the organization
II. LITERATURE SURVEY
or not. The discontinuous of an employee can be done
by either up to the individual or due to organization A lot of studies have been made on attrition
force. To predict attrition rate we have used different prediction analysis in the literature. The major focus
machine learning techniques. The steps are dataset was on predicting employee attrition. Researchers
collection, pre-processing the data, training model have applied machine learning classification models
using machine learning classification models like like logistic regression, random forests, support
Random Forest, decision tree classifier etc and result vector machine, and others to analyze the attributes
analysis . The results are evaluated using accuracy that impact the attrition rate. For instance,
score and confusion matrix. Random forest algorithm
Srivastava[1] et al presented a framework that
giving the best accuracy i.e 85% compared to decision
predicts employee churn by analyzing the behaviors
of employees and attributes with the help of machine
tree. This work will help organizations to better
learning techniques. Setiawan[5] et al through their
understand the attrition causes.
work found variables that have a major impact on
Keywords- Attrition, classification models, random employee attrition. Qasem A, A.Radaideh, and Eman
forest, SVM, decision tree classifier A Nagi have utilized data mining techniques to
construct a classification model that can anticipate
I. INTRODUCTION employees' performance. They implemented the
CRISP-DM data mining methodology in their
In these days, data produced at an exponential research and employed the decision tree as the
pace. This data has been useful in gaining knowledge primary data mining tool to build the classification
and spreading awareness about any company or model. Multiple classification rules were created as a
group. Before modelling data we have to pre-process result of this. The generated model was validated
the data with the goal of gaining insightful through a series of experiments using actual data
conclusions, recovering pertinent data to make wise obtained from various businesses. The purpose of the
decisions. It is a way of making a computer to make model is to forecast the performance of new job
correct predictions using historical information. applicants.
Employees are playing major role for any III. DATASET COLLECTION
company, so losing effective employees could have
a negative impact on the business in a number of The "IBM HR Employee Analytics Attrition and
ways. Employee attrition has a number of negative Performance" dataset was acquired from Kaggle, a
effects, including increased costs for hiring and website that provides datasets and serves as a venue
training new workers[1]. This will effect the well for data science-related contests [13]. There are 35
being of existing employees in the organization. This attributes and 1470 entries in this collection. The data
paper consists of 3 sections. Dataset collection is the categories include independent factors like "Age,"
first step and it is discussed in next step. section II "Daily Rate," "Education Field," "Number of
discuss the data pre-processing steps. This step is companies worked," etc.; however, in this study,
crucial for any machine learning project before "Attrition" is regarded as the dependent variable.
building model[9]. Dataset consists of inconsistent Two class names, "Yes" or "No," make up the
data , imbalanced class labels and unwanted "Attrition" data field.
attributes[8]. All these problems lead to poor model
construct. We are supposed to find important
attributes which impacts target attribute. For doing
this step we do feature importance on all attributes.

1
IV. DATA PRE-PROCESSING more about it. It required converting the data into a
structure that would make further research easier.
A).LIBRARIES USED:
This is usually the stage in the analytics lifecycle that
1) Libraries for Import: We take into consideration
requires the most effort and iterations [1]. The
the following potent and useful tools for the analysis
following are the main processes taken for data
and prediction of attrition rate. The libraries are:
preparation:
a) Numpy: It ranks among the most significant
a) Feature Reduction: This phase was important in
Python tools for computational mathematics and
deciding which features in the dataset should be kept
science.
and which features should be transformed or
b) Pandas: A tool made for quick and simple data removed in order to make decisions about which
frame processing. attributes in the data will be helpful for analysis in
the later stages. The decision as to which trait is
c) Matplotlib: A Python package that produces important and which is not for attrition forecast was
complex graphs and charts like bar charts, pie charts, made. Following are some examples of
and more. characteristics based on which elements were
d) Scikit-Learn: The SciKit-Learn package provides excluded from further analysis:
a variety of supervised and unsupervised machine i) Attributes with numbers that are not unique: There
learning methods. The main goal of machine learning are non-unique numbers for the following attributes:
tools is data modeling.
The number of the property "Employee count,"
2) Read Dataset: Read the dataset of .csv format which is "1" for each employee, is given by the
using pandas function read_csv(). employee count attribute.
3) Create dataset as Data Frame: Now create data "Standard working hours" is an attribute that
frame using read dataset object. This data frame will provides the number of an employee's standard
be used in further pre-processing steps. working hours, which is "80" for each entry. "Over
B) DATA PREPROCESSING 18 yrs of age" is an attribute that confirms whether
an employee meets the age requirement (to be over
Pre-processing means cleaning data, normalizing
18), which is "Yes" for every entry.
datasets and operate the changes in the data. These
steps are performed to get the datasets into a state that All the above mentioned attributes having only
enables analysis in further phases [1]. In Data one unique value .So, we are ignoring these attributes
preprocessing the following steps were performed: from dataset.
1) Investigate Dataset Properties: The goal of data ii) Data cleaning: To guarantee better data quality,
research was to comprehend the connections abnormalities, usually missing values, duplicate data,
between the factors and to examine the issue at hand and outliers are removed. In our dataset there are no
[1]. This research step is useful for spotting common missing values and outliers.
dataset problems like Null values, Outliers,
Redundancies, etc. Below figure 1 picture depicts the iii) Categorical to Numerical: Since categorical
columns of dataset and its datatype. variables are not accepted as input by normal
libraries, these values must be transformed into
numeric form. As seen in Figure 2, this was
accomplished by using the Label Encoder Method to
convert nominal category variables or categorical
data into numerical labels. The range of a numerical
identifier is always 0 to n_classes-1.

. Attrition Attrition
FIG: 1 ATTRIBUTES AND DATA TYPES Yes 1
2) Data preparation: This includes the procedures No 0
for exploring, pre-processing, and configuring data Yes 1
before data modeling. It was carried out in order to
familiarize ourselves with our information and learn FIG: 2 ATTRITION ATTRIBUTE LABEL ENCODING

2
iv) Dataset balancing: In given dataset, there are 2) Attrition on basis of gender:
more records with the label "Attrition" set to "0" than
there are records with the label "Attrition" set to "1,"
causing an unbalance. Figure 3 is a bar graph that
displays the number of labels in the collection for
each label.

FIG: 5 BAR CHART REPRESENTATION FOR ‘GENDER’


FIG: 3 BAR PLOT FOR TARGET ATTRIBUTE DISTRIBUTION
Figure 5 shows that the turnover rate is not
Using the Synthetic Minority Oversampling significantly influenced by the employee's gender. In
Technique (SMOTE), entries for the class with a each instance, the turnover rate stays about the same.
lower total were artificially generated. SMOTE, a This demonstrates that Gender is not a characteristic
method for oversampling the minority class, was that should be considered for inclusion in future
chosen over under sampling because the latter could attrition forecast methods. These graphic
lead to the removal of important data[12]. representations make feature selection and reduction
C) VISUALIZATION much more understandable and simple.
This process provides valuable insights into the
dataset and helps to distinguish important features
from irrelevant ones. Overall, visualization is a V. FEATURE IMPORTANCE AND TRAIN
crucial step in data analysis that enables us to quickly MODEL
gain a high-level understanding of the data and make A).Divide dataset into Train and Test: To prepare
informed decisions about feature selection. the data for machine learning, the 'DataFrame' was
divided into two subsets: Train and Test. The Train
1) Attrition vs Business Travel:
set was used to train the machine learning algorithm,
From Figure 4, we can clearly knowing that Non- and the knowledge gained was used to predict the
Travel employees having low attrition rate. In other required attribute for the Test set. It is important to
way, employees who travel from one place to other have a larger Train set than Test set as this helps the
place on business purpose, having high attrition rate. machine learn better from the dataset. Typically, the
train data should be around 70-85% of the entire
dataset. In particular case, the train data consists of
75% of the 'DataFrame', i.e 1249 rows, where other
15% or 221 rows are from test data.

B) Feature Importance: In machine learning, feature


importance refers to the process of determining the
relative importance of different input variables[7], or
features, in predicting the output of a model.

Feature importance is useful because it helps


to identify which features are most relevant to the
FIG: 4 BAR PLOT REPRESENTATION FOR BUSINESS TRAVEL problem being solved, and which features can be
ignored or removed to simplify the model without
sacrificing accuracy. This information can be used to
optimize the performance of the model by focusing
on the most important features and reducing the
dimensionality of the data.

3
C) Machine Learning Models for prediction: also find out the accuracy score by
(TP+TN)/(TN+TP+FN+FP)[10][11]
After preparing the data, the next step in using
machine learning models for prediction involves an Among all classification algorithms, we
loop process that aims to improve the accuracy of the observe that Random Forest algorithm giving best
models. There are several classification models that accuracy score and also predicting accurately on
can be used for this purpose: unknown data.

1).Decision Tree Classifier: This method is suitable


for multistage decision-making and breaks down
complex decisions into elementary ones for easy
interpretation[2][3].

2).Support Vector Machine (SVM): This approach


can be utilized for both classification and regression
tasks, and it involves constructing a hyperplane with
maximum margin in a transformed input space to
separate different classes of examples. The goal is to
ensure that the hyperplane is as far as possible from
the nearest correctly classified examples, which FIG:6 BAR GRAPH FOR TEST ACCURACIES

results in a well-separated and accurately classified Figure 6 represents test accuracies of


dataset[4]. algorithms . from above figure we clearly observe
3).Logistic Regression: It is one of the simplest that random forest algorithm gives best accuracy
supervised machine learning algorithms. The logistic compared to other.
regression technique employs a linear model to
convert the predictor variables into a probability
value between 0 and 1. The logistic function
parameters are estimated by the model using a
technique known as maximum likelihood estimation,
that involves determining the parameter values that
affect the probability of observing the data. [14][6]

4).Random Forest: This method is an ensemble


learning algorithm that generates multiple sub
decision trees and merges them to generate better
accurate and stable prediction[9]. FIG:7 CONFUSION MATRIX FOR RANDOM FOREST

In this project we train model using Random Forest Figure 7, shows how currectly the random
algorithm , logistic regression[6], SVM and Decision forest algorithm predict the target class.
tree.In these algorithms random forest gives best
accuracy when compared to other classification
models. VI. CONCLUSION
D) Result Analysis: After assessing the execution of four classification
models, a significant finding was that if feature
When evaluating a machine learning model, it is reduction for prediction is appropriately conducted,
important to use appropriate metrics to measure its the accuracy rate of the classification models always
performance. Three commonly used metrics in be better compared to classification with feature
machine learning are: selection. In particular, the Random Forest classifier
with feature reduction achieved an accuracy score of
1).Accuracy: This metric used to measure the
85.3%, while the Decision tree classifier achieved
proportion of correct predictions made by the model
83%. Random Forest model giving best classification
over the total number of predictions. It is calculated
for True positives and True negatives data. The
by dividing the number of correct predictions by the
methods described in the paper for analyzing and
total number of predictions made.
categorizing data can form a basis for improving
2) Confusion Matrix: It is a matrix representation of data-driven decision-making processes. These
TP,TN,FP and FN values. Using this matrix we can techniques can unlock new insights from data and

4
help organizations improve their operations.
Implementation of these methods can also contribute
to a positive work culture and improve an
organization's reputation in their respective industry.

REFERENCES
[1] Srivastava, Devesh Kumar, and Priyanka Nair. "Employee
attrition analysis using predictive techniques." International
Conference on Information and Communication
Technology for Intelligent Systems. Springer, Cham, 2017.
[2] S. S. Gavankar and S. D. Sawarkar, "Eager decision tree,"
2017 2nd International Conference for Convergence in
Technology (I2CT), Mumbai, 2017, pp. 837-840.
[3] Safavian, S.R. Landgrebe. D, “A survey of decision tree
classifier methodology”, IEEE Transactions on Systems,
Man, And Cybernetics, Vol. 21, No. 3, May-June 1991.
[4] Shmilovici A. (2009) Support Vector Machines. In: Maimon
O., Rokach L. (eds) Data Mining and Knowledge Discovery
Handbook. Springer, Boston.
[5] Setiawan, I., et al. "HR analytics: Employee attrition analysis
using logistic regression." IOP Conference Series: Materials
Science and Engineering. Vol. 830. No. 3. IOP Publishing,
2020
[6] Schober, Patrick MD, PhD, MMedStat*; Vetter, Thomas R.
MD, MPH†. Logistic Regression in Medical Research.
Anesthesia & Analgesia 132(2):p 365-366, February 2021.
| DOI: 10.1213/ANE.0000000000005247
[7] Jayalekshmi J, Tessy Mathew, “Facial Expression
Recognition and Emotion Classification System for
Sentiment Analysis”, 2017 5 Authorized licensed use
limited to: University College London. Downloaded on
May 23,2020 at 00:07:22 UTC from IEEE Xplore.
Restrictions apply. International Conference on Networks &
Advances in Computational Technologies (NetACT) |20-22
July 2017| Trivandrum.
[8] Isabelle Guyon, Andre Elisseeff, “An Introduction to Variable
and Feature Selection”, Journal of Machine Learning
Research 3 (2003) 1157-1182.
[9] Ilan Reinstein, “Random Forest(r), Explained”,
kdnuggets.com, October
2017[Online].Available:https://www.kdnuggets.com/2017/
10/randomforests-explained.html
[10] http://scikitlearn.org/stable/modules/generated/sklearn.metri
cs.confusion_matrix.html
[11] http://scikitlearn.org/stable/auto_examples/model_selection/
plot_confusion_matrix.ht ml
[12] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W.
Philip Kegelmeyer, “SMOTE: Synthetic Minority Over-
sampling Technique”, Journal of Artificial Intelligence
Research 16 (2002), 321 – 357
[13] Pavan Subhash, “IBM HR Analytics Employee Attrition &
Performance”,
www.kaggle.com,2016[Online].Available:https://www.kag
gle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
[14] Sperandei S. Understanding logistic regression analysis.
Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi:
10.11613/BM.2014.003. PMID: 24627710; PMCID:
PMC3936971.

You might also like